Lazy-cleaner is a python package that cleans your dataframe and that includes correcting data types and class labeling techniques and filling the Null values with more than one method saving the time of coding exeplicit methods.
- saving time.
- preprocessing data like a pro.
- user friendly for begineer Data learners.
To install Lazy-cleaner Package:
- Make sure python3 and pip is installed in your machine.
- Use the command line and type:
pip install Lazy-cleaner
orpip3 install Lazy-cleaner
The most simple format of using Lazy-cleaner is as follow:
from Lazy_cleaner.Lazy_cleaner import stat
from Lazy_cleaner.Lazy_cleaner import fillnan
from Lazy_cleaner.Lazy_cleaner import label
from Lazy_cleaner.Lazy_cleaner import Clean_Data
import pandas as pd
Data = pd.read_csv("FileName.csv")
fill = fillnan()
New_Data = fill.simpleimputer(Data)
fill the Nan values by Linear Regression or by Random Forest.
also consider filling by median manually before using this method
- messy_df -> data frame you want to work on.
- metric -> model you want to work with (Linear Reg , Random forest).
- colnames-> list of column names that contains Null vlaues.
Returns : the cleaned dataframe after filling the Null values and a list of the missing values
calculate Ordinary Least Squares to know which features are important and make overview of the regression model.
- df -> the dataframe.
- label_column -> the Target column Name in string format.
Returns : summary report about the features and their p value
calculate the percentage of Nan values in the whole dataset
- df -> dataframe
returns: the percentage of Nan values in the whole dataset
this function fill the Null values with simple techniques like (mean,mode) depending on the data type of the column
also consider filling by median manually before using this method
- df -> dataframe
returns: the dataframe after editing
this function compines between the simple imputation , the linear regression imputation and random forests imputation as in object columns it will impute with the mode and for any numerical number it will impute with linear regression or random forest
also consider filling by median manually before using this method
- df->dataframe
- metric ->"Linear Regression" or "Random Forests" models to fill you numeric nan values with
returns: the dataframe after editing
fill nan values by the next value
- df -> dataframe
- columns -> a list of columns you want to fill
returns: the dataframe after editing
this one fill nan values by the previous value
- df -> dataframe
- columns -> a list of columns you want to fill
returns: the dataframe after editing
change from and object datatype column into categorie datatype column
- df-> dataframe
returns: the dataframe after editing
replace objects by how frequent they are in a certain column
- df -> dataframe
- column -> column you want to apply this method on
returns: the dataframe after editing
reduce memory usage of the dataframe.
- props-> dataset you want to reduce
returns: the dataframe after editing and Nan values list
if there is simillar words but different format due to a data entry you can apply this function to calculate similarity percentage and then change them
- df -> dataframe
- column -> column to edit
- string_to_match -> string you want to match and replace
- min_ratio -> minimum probability to excange
returns: the dataframe after editing
drop the columns if the missing values exceed 60% of it
- df-> dataframe
- thresh->percentage of the missing threshold to Delete above
returns: the dataframe after editing
apply Log function to the features.
- df -> dataframe you want to apply log func to
returns: dataset after applying log
removes outliers with two techniques z-score and IQR ("Inter-Quartile Range").
- df -> dataframe
- type_o -> type of the method you want to choose
Returns : the dataframe after removing the outliers
normalize columns by your choice in a dataframe
- df -> dataframe
- columns -> list of columns to normalize its values
returns: the dataframe after editing
encode and decode any string format
- decode -> if you want to encode assign as False if decode assign True
- typeo -> the format you want to encode to or decode from
returns: the result after encoding or decoding