Skip to content

Spark-lean, an interactive PySpark-based Data Cleaning Library

License

Notifications You must be signed in to change notification settings

allenlsj/Spark-lean

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark-lean

Spark-lean, an interactive PySpark-based Data Cleaning Library

Features

  • Data versioning
  • Missing value detection
  • Text cleaning
  • Featurization
  • String Matching
  • Anomaly detectation

Installation

pip install Spark-lean

Idea

Spark-lean is a toolkit we built for cleaning and pre-processing large-scale datasets. From our own experience with different data cleaning libraries, we designed a very unique structure that makes the process more interactive and user-friendly. We minimize the operations that users need to perform and provide essential information to users at the same time.

Assumptions

When we were designing this library, we made a few assumptions about the use-cases:

  • Large-scale Data
  • Data-frame structure (helper function converts .json to .csv file type is provided)
  • Single dataset

How to use

Dependencies

Please make sure that you have Pyspark installed and have run it successfully on Python 3.4+.

Initialization

  • start SparkContext
  • read data

About

Spark-lean, an interactive PySpark-based Data Cleaning Library

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%