In this Repository is my project developed during the participation in the second Data Science Challenge promoted by Alura. Alura is a Brazilian online platform for technology courses.
The purpose of these Challenges is to deepen students' knowledge in the area of Data Science, through practical challenges. More specific in this case we will use PySpark to make all the ETL process, build a regression model to price real estate and create a real estate recommender.
🪧 Vitrine.Dev | |
---|---|
✨ Project Name | Alura Data Science Challenge II |
🏷️ Tecnologias | Python |
🚀 Libraries | PySpark, zipfile, Seaborn and pyplot |
🔥 Challenge | https://www.alura.com.br/challenges/data-science-2/ |
Objectivies: Build a regression model to price real estate and create a real estate recommender.
Data: The data is avaible here and the data dictionary here
Structure: The challenge is divided in 3 parts: ETL (Extract, Transform and load), Creating Regression Models and
The first part of the project is dedicated to the ETL process of the data. Extracting the data in json format into python, for subsequent transformation/cleaning of the data, followed by loading the data into in csv and parquet file format. All of these activies were performed using the PySpark library.
At this stage, the translation of the data into English was also carried out.
All activities performed are documented in this notebook.
The second part of the project is dedicated to:
- Process the data from the first notebook "1 - Extract, Transform and Load" to use Regression Models.
- Treating null and NaN data;
- Treating missing data in the zone columns
- Transforming categorical columns into binary columns (0, 1)
- Removing useless columns
- Saving the DataFrame in a parquet file
- Creating Models
- Vectorizing the data (Vector Assembler)
- Creating 4 models (Linear Regression, Decision Tree Regressor, Random Forest Regressor and Gradient-boosted tree Regressor)
- Optimizing the best model
- Cross Validation and Hyperparameters Testing
All activities performed are documented in this notebook.
Working ...