Alura Data Science Challenge II

In this Repository is my project developed during the participation in the second Data Science Challenge promoted by Alura. Alura is a Brazilian online platform for technology courses.

The purpose of these Challenges is to deepen students' knowledge in the area of Data Science, through practical challenges. More specific in this case we will use PySpark to make all the ETL process, build a regression model to price real estate and create a real estate recommender.

🪧 Vitrine.Dev
✨ Project Name	Alura Data Science Challenge II
🏷️ Tecnologias	Python
🚀 Libraries	PySpark, zipfile, Seaborn and pyplot
🔥 Challenge	https://www.alura.com.br/challenges/data-science-2/

About the Challenge

Objectivies: Build a regression model to price real estate and create a real estate recommender.

Data: The data is avaible here and the data dictionary here

Structure: The challenge is divided in 3 parts: ETL (Extract, Transform and load), Creating Regression Models and

1 - Extract, Transform and Load

The first part of the project is dedicated to the ETL process of the data. Extracting the data in json format into python, for subsequent transformation/cleaning of the data, followed by loading the data into in csv and parquet file format. All of these activies were performed using the PySpark library.

At this stage, the translation of the data into English was also carried out.

CSV file. Parquet file.

All activities performed are documented in this notebook.

2 - Regression Models

The second part of the project is dedicated to:

Process the data from the first notebook "1 - Extract, Transform and Load" to use Regression Models.
- Treating null and NaN data;
- Treating missing data in the zone columns
- Transforming categorical columns into binary columns (0, 1)
- Removing useless columns
- Saving the DataFrame in a parquet file
Creating Models
- Vectorizing the data (Vector Assembler)
- Creating 4 models (Linear Regression, Decision Tree Regressor, Random Forest Regressor and Gradient-boosted tree Regressor)
Optimizing the best model
- Cross Validation and Hyperparameters Testing

Parquet file generated.

All activities performed are documented in this notebook.

Semana 3

Working ...

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
1 - Extract, Transform and Load.ipynb		1 - Extract, Transform and Load.ipynb
2 - Regression Models.ipynb		2 - Regression Models.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Alura Data Science Challenge II

About the Challenge

1 - Extract, Transform and Load

2 - Regression Models

All activities performed are documented in this notebook.

Semana 3

About

Releases

Packages

Languages

Lacerdash/Machine-Learning-with-PySpark

Folders and files

Latest commit

History

Repository files navigation

Alura Data Science Challenge II

About the Challenge

1 - Extract, Transform and Load

2 - Regression Models

All activities performed are documented in this notebook.

Semana 3

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages