This repository was created to manage a couple of assignments of the Large scale methods course of the M. Sc. Data science at ITAM. The main goal was the refactoring and improvement of a single Jupyter notebook found under the competence Kaggle house price prediction. While the refactoring stands for creating Python modules and functions, the improvements were the application of the programming best practices and recommendations following the PEP8 style guide. Also the migration of the code to production style was considered under the activities of improving the structure of the code, as well as including virtual environments and a Docker definition to deploy the application.
All the scripts were developed using Python 3.9.13
as programming language, and the following libraries: pandas
and numpy
for data manipulation, json
for reading external data, seaborn
and matplotlib
for data visualization, scikit-learn
for data processing and machine learning models, typing
for data type handling, pylint
and flake8
for PEP8 style code formatting, and pytest
for testing of functions and modules.
To reproduce the environment described above, an environments.yaml
file was created at the root of the project to use with conda
. The following command should be executed in order to re-create the house-pricing
conda environment:
conda env create --file environments.yaml
The data and its dictionary is available for free under the app
tab of the Kaggle competition.
The app
folder contains all the code and required configurations to execute the application. Under this folder there are several single purpose folders ordered as follows:
config:
is related to basic configs to use execute the main code. A config-sample.ini file was added to take it as an example of the content that the actual config.ini file should have.data:
should contain all the .csv files to be used in the execution of the code (train.csv and test.csv).images:
stores the resulting plots after executing the EDA functions over the train dataset.logs:
stores the logs and messages of the current execution of the code.msc
: contains a single JSON file with the column' names and their encoding levels.results:
created with the objective of storing the resulting predictions using the trained model.src:
is composed by 3 core Python scrips:cleaning.py:
contains the functions related to fill and impute NA values based on the needs of the problem.eda.py:
contains the functions to perform a single EDA and store the resulting plots.preprocessing.py:
contains the functions to generate the encoding of the variables and to create pre-defined interactions.
test:
contains the modules used to test each function of the modules defined under the src folder.
In addition to those folders, a main.py
file is located under the app folder. This file contains all the steps (described below) that will be executed, from taking the data to generate the plots and the final predictions.
By using the command python main.py <num_max_leaf_nodes>
from terminal you could be able to execute the process and generate the plots and the predictions under the respective folders. The <num_max_leaf_nodes>
argument is required for the training process of the random forest model. An example of how to execute the code is as follows: python main.py 250
.
- The
num_max_leaf_nodes
argument is retrieved and the logging basic configuration is loaded. - The config values are loaded from the
config.ini
file by using theConfigValues
class. - The train and test datasets are being read, also the Ids are obtained and stored in an independent variable.
- A couple of plots are generated from the EDA process and are stored into the
images
folder. The first one is a heatmap of null values, which represents the presence or absence of values for one or more variables in each register of the train dataset. The second one is a mix of plots to make explicit the dispersion and distribution of the values for some variables. - A pipeline is executed for each dataset. The pipeline is composed by:
3.1) Drop the unwanted columns.
3.2) Fill the 'NA' values of specific desired variables using a custom value.
3.3) Fill the 'NA' values of the rest of the variables based on the type of each variable: numerical or categorical.
3.4) Encode a predefined set of variables using the
Ordinal encoder
operation and themap_encoders.json
file. 3.5) Encode a predefined set of categorical variables using theLabel encoder
operation. 3.6) Create a predefined set of variables by interactions between existing variables. 3.7) Drop the variables that are not used for the final prediction. - Set the input and output variables to perform the training process and the predictions.
- Create a
Random Forest Regressor
model and train it using the train dataset and the input and output variables. - Calculate a list of average scores for the model by using the
cross validation
method. - Finally, calculate the predictions for the test dataset and save them into the
results
folder.
Two Dockerfiles were created to execute the application on a single Docker container. If you want to deploy the application in a separate container, the Dockerfile
file should be used. If you want to deploy the application mounting the disk, the Dockerfile.mount
file should be used.
No matter what option you choose to deploy the application, the Dockerfile to use has to be named as Dockerfile
, and should be the only one file with this name, so be careful when renaming the Dockerfile that you don't want to use.
The commands to deploy the application in a separate container are the following:
docker build -t house-pricing .
(this command will compile the Docker image and will be named as house_pricing).docker run -it --rm house-pricing
(this command will run the application on a separate container and will remove it when the code finished).
The commands to deploy the application mounting the disk are the following:
docker build -t house-pricing .
(this command will compile the Docker image and will be named as house_pricing).docker run -it --rm -v <absolute>/<path>/<to>/<repo>/house-pricing-refactoring/app:/app house-pricing
(this command will run the application by mounting the disk to execute the code and will remove it when the code finished).