GitHub - jcomeauictx/bigdata: brutally simple Python "big data" processing using CSV files without any framework

I did this out of frustration with attempting to use Spark on Amazon EMR.

all the time I lost attempting to get the cluster to work with my Spark scripts is apparently not going to be paid. so I hacked together something that does the same thing as the Pandas scripts whose functionality I was attempting to duplicate, on my own time, in about 48 hours. and it does it in a fraction of the code, and it's better in that it weeds out duplicated data in the input files.

it's not fast enough yet, but as long as the tables to which you're joining are small, it fits easily in RAM, and using aws s3 cp, you don't need to store the output locally, you can stream it directly to an S3 bucket.

here it is in action creating a 35gb spreadsheet:

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
images		images
Makefile		Makefile
README.md		README.md
badpsv.py		badpsv.py
calculate.py		calculate.py
deduplicate.py		deduplicate.py
goodpsv.py		goodpsv.py
left_outer_join.py		left_outer_join.py
min_digits.py		min_digits.py
removeheaders.py		removeheaders.py
reorder.py		reorder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Sponsor this project

Packages

Languages

jcomeauictx/bigdata

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages