Connected Component Finder in PySpark

PySpark implementation of the MapReduce "Connected Component Finder" algorithm

Project description

This project is the final project for the Systems and Paradigms for Big Data class of Master IASD, a PSL Research University master's programme. It consists in a Spark implementation of the MapReduce algorithm described in CCF: Fast and Scalable Connected Component Computation in MapReduce.

Running the code on your machine

You need to have Spark installed on your computer. An alternative is to run it on DataBricks.

Downloading the graphs

To download the graphs described in the report, run in a terminal:

wget -P /tmp http://snap.stanford.edu/data/web-Google.txt.gz
gunzip /tmp/web-Google.txt.gz
wget -P /tmp http://snap.stanford.edu/data/cit-HepTh.txt.gz
gunzip /tmp/cit-HepTh.txt.gz

The graphs are now in your /tmp directory. You can now move them to the directory you want, but don't forget to change the paths in main.py accordingly.

On DataBricks

Once you have downloaded the graphs, execute the following in a Python notebook:

dbutils.fs.mv("file:/tmp/web-Google.txt", "dbfs:/FileStore/tables/web-Google.txt")  
dbutils.fs.mv("file:/tmp/cit-HepTh.txt", "dbfs:/FileStore/tables/cit-HepTh.txt")

This will move the files to the distributed filesystem.

Usage

On DataBricks

Copy the code from ccf_pyspark.py and main.py to the notebook, and change the code from main.py accordingly.

On your computer

Execute the following in a terminal:

spark-submit main.py --method [METHOD] --graph [GRAPH] --show [SHOW]

All arguments are optional. If you need explanations on the arguments, run python main.py -h.

Report

The report for this project can be found in the /doc directory of the repository.

References

Stanford Network Analysis Project (SNAP): a C++ library for graph mining and analytics, with open access to a large number of graphs.
CCF: Fast and Scalable Connected Component Computation in MapReduce.
For an introduction to Secondary Sorting in MapReduce: Data-intensive text processing with MapReduce, Chapter 3.
For another (and more detailed) PySpark implementation of Secondary Sorting: Spark Secondary Sort.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
doc		doc
LICENSE		LICENSE
README.md		README.md
ccf_pyspark.py		ccf_pyspark.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Connected Component Finder in PySpark

Project description

Running the code on your machine

Downloading the graphs

On DataBricks

Usage

On DataBricks

On your computer

Report

References

About

Releases

Packages

Languages

License

PFMassiani/ccf-pyspark

Folders and files

Latest commit

History

Repository files navigation

Connected Component Finder in PySpark

Project description

Running the code on your machine

Downloading the graphs

On DataBricks

Usage

On DataBricks

On your computer

Report

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages