This project analyzes house prices in Madrid, Spain using Python and several machine learning libraries. The project assumes a basic understanding of data analysis and machine learning concepts, and requires the following steps to install and use:
- Create a Python environment using your preferred method (e.g.
conda
,virtualenv
, etc.). - Activate the environment and navigate to the project directory.
- Install the required packages using
pip
and therequirements.txt
file:
pip install -r requirements.txt
- Install the
utils
module by running the following command from the project directory:
pip install -e src/
- Start a JupyterLab server by running the following command:
jupyter lab
Alternatively, you can use the
ipykernel
package to select the kernel directly from the environment inside VSCode.
- Navigate to the
notebooks
directory and open the desired notebook. - Execute the cells in the notebook to preprocess the data, perform exploratory data analysis, and build and evaluate machine learning models.
- The data is stored in the
data
directory, which contains four subfolders:raw
: contains the raw training and testing data in CSV format.processed
: contains the processed data in CSV format.models
: contains the trained machine learning models as pickle files, along with performance metrics as JSON files.submission
: contains the submission files in CSV format.
- The
src
directory contains a Python module with the necessarysklearn
transformers for ETL and utility functions. - The
notebooks
directory contains the notebooks to execute to verify all the steps followed for the analysis of the houses in Madrid.
house_price_analysis/
├── data/
│ ├── raw/
│ │ ├── train.csv
│ │ └── predict.csv
│ ├── processed/
│ │ ├── train.csv
│ │ └── test.csv
│ ├── models/
│ │ ├── model_1.pkl
│ ├── metrics/
│ │ └── model_1.json
│ └── submission/
│ ├── submission_1.csv
│ └── submission_2.csv
├── src/
│ ├── utils/
│ │ ├── transformers.py
│ │ ├── paths.py
│ │ ├── functions.py
│ │ └── __init__.py
│ ├── pyproject.toml
│ ├── setup.cfg
│ └── setup.py
└── notebooks/
├── 01_EDA.ipynb
└── 02_Modeling.ipynb
This directory structure shows the organization of the project. The data
directory contains the raw and processed data, as well as the models and submission files. The src
directory contains the Python module with the necessary transformer and utility functions. The notebooks
directory contains the notebooks to execute to verify all the steps followed for the analysis of the houses in Madrid.
The data used for this project is from the Kaggle competition "Machine Learning Avanzado I - Hands-on". The data is split into two files: train.csv
and predict.csv
. The train.csv
file contains the training data, which includes the target variable buy_price_by_area
. The predict.csv
file contains the submission data, which does not include the target variable. The goal of the project is to predict the buy_price_by_area
of the houses in the predict.csv
file.