This study focuses on improving visualization technics throughout the EDA and Feature Engineering process before the model development. The Google Play Store dataset is used for this study, it includes the app's information on the different categories.
Generally, in the Kaggle notebooks, the main purpose of using this dataset is to predict the number of installs of the apps according to the given features. However, the focus of this study is not on developing the prediction model but is dealing with the techniques and details of the model development process preprocessing. Because preprocessing is one of the most important processes of model development. Especially, visualization technics are very helpful for this purpose. Extracting information is a leading process to decide what we expect from the model and which features can be more essential to detect the target feature.
Actually, this study does not include detailed information about the dataset, but it provides all techniques/codes to make data transformation, descriptive analysis, and visualization. So, you can use these techniques and perspectives before each model development process. The dataset includes categorical and numeric values at the same time, so you can find how you can deal with both features.
I hope this notebook will be a good resource for preprocessing and exploratory data analysis with visualization techniques.
The dataset used in this study is obtained from the Kaggle, you can reach it from this link. Only 'googleplaystro.csv' is used for this study. You can also reach the dataset below the dataset folder. The dataset includes 13 features, you can find the details of the dataset in the data transformation notebook.
The transformed dataset in the first phase also was uploaded.
To install the dependencies to run the notebook, you can use Anaconda. Once you have installed Anaconda, run:
$ conda env create -f environment.yml
data-transformation.ipynb notebook includes all data cleaning, and transformation processes.
eda-visualization.ipynb includes all visualization techniques for univariate and bivariate analysis.
Throughout this study, several resources helped but especially the Exploratory Data Analysis with Python Cookbook By Ayodele Oluleye helped to how can we approach when the data is visualized. It's a strongly recommended resource. You can find the other resources;
- Data Science — A comprehensive analysis on “Google Play Store Apps” dataset from Kaggle Blog
- Kaggle Useful Notebook
If you want to contribute please, send your pull request. All contributions are welcome!
Please check that repository for updates, for opening issues or sending pull requests.