Data analysis with Pandas, Numpy, Matplotlib & Seaborn.
Project consists to analyse a publicly available movie dataset found in https://www.kaggle.com/beyjin/movies-1990-to-2017 and use Python tools like Pandas in order to get some initial insights about the dataset and finally proceeding to clean, transform and save a new version of the dataset in a better structure thinking about storing the data in a database.
-
- initial_insights.ipynb
- clean_datasets.ipynb
- cleaned_datasets_grouped.ipynb
- Raw & Cleaned Datasets
There are 3 files which you can look in this exact order
-
Taking a first look to the raw datasets and finding insights that help us understand the data we will be processing and also to get an overview on how we should structure the datasets as if we where going to store the data into a database
Note: insights and conclusions can be found in the jupyter file
-
We go here through the whole process standardizing the data types, extracting columns that should go in a different dataset and saving the and cleaned datasets.
-
cleaned_datasets_grouped.ipynb
Here we take the cleaned datasets and we just join them all together into a big and only one dataset
-
Raw & Cleaned Datasets
- The original datasets (raw) are located in the folder
orignal_datasets/
- The output generated datasets (cleaned) will be located in the folder
output/
- The original datasets (raw) are located in the folder
Technology Stack | ||
---|---|---|
Python | Language | |
Pandas | Data Analysis & Manipulation | |
Numpy | Data Computing | |
Matplotlib | Data Visualization | |
Seaborn | Data Visualization |
Get in touch -–> fantaso