Skip to content

PySpark implementation of the MapReduce "Connected Component Finder" algorithm

License

Notifications You must be signed in to change notification settings

PFMassiani/ccf-pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Connected Component Finder in PySpark

PySpark implementation of the MapReduce "Connected Component Finder" algorithm

Project description

This project is the final project for the Systems and Paradigms for Big Data class of Master IASD, a PSL Research University master's programme. It consists in a Spark implementation of the MapReduce algorithm described in CCF: Fast and Scalable Connected Component Computation in MapReduce.

Running the code on your machine

You need to have Spark installed on your computer. An alternative is to run it on DataBricks.

Downloading the graphs

To download the graphs described in the report, run in a terminal:

wget -P /tmp http://snap.stanford.edu/data/web-Google.txt.gz
gunzip /tmp/web-Google.txt.gz
wget -P /tmp http://snap.stanford.edu/data/cit-HepTh.txt.gz
gunzip /tmp/cit-HepTh.txt.gz

The graphs are now in your /tmp directory. You can now move them to the directory you want, but don't forget to change the paths in main.py accordingly.

On DataBricks

Once you have downloaded the graphs, execute the following in a Python notebook:

dbutils.fs.mv("file:/tmp/web-Google.txt", "dbfs:/FileStore/tables/web-Google.txt")  
dbutils.fs.mv("file:/tmp/cit-HepTh.txt", "dbfs:/FileStore/tables/cit-HepTh.txt")  

This will move the files to the distributed filesystem.

Usage

On DataBricks

Copy the code from ccf_pyspark.py and main.py to the notebook, and change the code from main.py accordingly.

On your computer

Execute the following in a terminal:

spark-submit main.py --method [METHOD] --graph [GRAPH] --show [SHOW]

All arguments are optional. If you need explanations on the arguments, run python main.py -h.

Report

The report for this project can be found in the /doc directory of the repository.

References

About

PySpark implementation of the MapReduce "Connected Component Finder" algorithm

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages