Spark-lean, an interactive PySpark-based Data Cleaning Library
- Data versioning
- Missing value detection
- Text cleaning
- Featurization
- String Matching
- Anomaly detectation
pip install Spark-lean
Spark-lean is a toolkit we built for cleaning and pre-processing large-scale datasets. From our own experience with different data cleaning libraries, we designed a very unique structure that makes the process more interactive and user-friendly. We minimize the operations that users need to perform and provide essential information to users at the same time.
When we were designing this library, we made a few assumptions about the use-cases:
- Large-scale Data
- Data-frame structure (helper function converts .json to .csv file type is provided)
- Single dataset
Please make sure that you have Pyspark installed and have run it successfully on Python 3.4+.
- start SparkContext
- read data