PySpark implementation of the MapReduce "Connected Component Finder" algorithm
This project is the final project for the Systems and Paradigms for Big Data class of Master IASD, a PSL Research University master's programme. It consists in a Spark implementation of the MapReduce algorithm described in CCF: Fast and Scalable Connected Component Computation in MapReduce.
You need to have Spark installed on your computer. An alternative is to run it on DataBricks.
To download the graphs described in the report, run in a terminal:
wget -P /tmp http://snap.stanford.edu/data/web-Google.txt.gz
gunzip /tmp/web-Google.txt.gz
wget -P /tmp http://snap.stanford.edu/data/cit-HepTh.txt.gz
gunzip /tmp/cit-HepTh.txt.gz
The graphs are now in your /tmp
directory. You can now move them to the directory you want, but don't forget to change the paths in main.py
accordingly.
Once you have downloaded the graphs, execute the following in a Python notebook:
dbutils.fs.mv("file:/tmp/web-Google.txt", "dbfs:/FileStore/tables/web-Google.txt")
dbutils.fs.mv("file:/tmp/cit-HepTh.txt", "dbfs:/FileStore/tables/cit-HepTh.txt")
This will move the files to the distributed filesystem.
Copy the code from ccf_pyspark.py
and main.py
to the notebook, and change the code from main.py
accordingly.
Execute the following in a terminal:
spark-submit main.py --method [METHOD] --graph [GRAPH] --show [SHOW]
All arguments are optional. If you need explanations on the arguments, run python main.py -h
.
The report for this project can be found in the /doc
directory of the repository.
- Stanford Network Analysis Project (SNAP): a C++ library for graph mining and analytics, with open access to a large number of graphs.
- CCF: Fast and Scalable Connected Component Computation in MapReduce.
- For an introduction to Secondary Sorting in MapReduce: Data-intensive text processing with MapReduce, Chapter 3.
- For another (and more detailed) PySpark implementation of Secondary Sorting: Spark Secondary Sort.