Spark-lean

Spark-lean, an interactive PySpark-based Data Cleaning Library

Features

Data versioning
Missing value detection
Text cleaning
Featurization
String Matching
Anomaly detectation

Installation

pip install Spark-lean

Idea

Spark-lean is a toolkit we built for cleaning and pre-processing large-scale datasets. From our own experience with different data cleaning libraries, we designed a very unique structure that makes the process more interactive and user-friendly. We minimize the operations that users need to perform and provide essential information to users at the same time.

Assumptions

When we were designing this library, we made a few assumptions about the use-cases:

Large-scale Data
Data-frame structure (helper function converts .json to .csv file type is provided)
Single dataset

How to use

Dependencies

Please make sure that you have Pyspark installed and have run it successfully on Python 3.4+.

Initialization

start SparkContext
read data

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
spark_lean		spark_lean
LICENSE		LICENSE
Presentation.pdf		Presentation.pdf
README.md		README.md
Report.pdf		Report.pdf
read_json.py		read_json.py
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark-lean

Features

Installation

Idea

Assumptions

How to use

Dependencies

Initialization

About

Releases

Packages

Languages

License

allenlsj/Spark-lean

Folders and files

Latest commit

History

Repository files navigation

Spark-lean

Features

Installation

Idea

Assumptions

How to use

Dependencies

Initialization

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages