HI!CKATHON 2023

GROUP NAME : Group 14

Date : January 14th, 2023

I. OVERVIEW

1. Project Background and Description

Nowadays, while real estate greatly contributes to GHG emissions, investments in insulation works are still far behind their objectives.

The purpose of the project is to use Machine Learning to predict energy consumption of buildings. A better understanding of the importance of insulation works on the energy consumption of buildings will enhance energy transition investments.

2. Project Scope

The project scope will include the Machine Learning regression task and the business pitch.

Therefore, on a technical standpoint, the project will aim to preprocess the data to facilitate learning, designing and training a model capable of predicting the target values.

On a business standpoint, the project will also aim to prepare a business pitch that leverages said ML model and that estimate energy and cost savings.

However, while a proper ML pipeline will be designed using Python libraries, the said application will not be developed during the Hackathon.

The repository containing the code for preprocessing, training and inference will be uploaded on Gitlab, alongside this document.

3. Presentation of the group

First name	Last name	Year of studies & Profile	School	Skills	Roles/Tasks
Agathe	MINARO	Master 2 ENSAE	ENSAE	Data Science	Modeling, Training
Guillaume	PHILIPPE	Master 2 Data Science	IP Paris	Data Science, Computer Science	Training, Testing
Huu Tan	MAI	Master 2 Telecom Paris	Telecom Paris	Data Engineering/Science	Preprocessing, Refactoring, Writing
Max Cameron	WU	Master 2 Data Science	IP Paris	Data Science	Feature Engineering, Finetuning
Yassine	HARGANE	Master 2 HEC Paris	HEC Paris	Business Engineering	Business

4. Task Management

The team set a meeting every morning to discuss the progress and the next steps. Regarding the code, we use a git repository for version control and collaborative working.

On the first day, most of the team effort was directed towards preprocessing. Max and Tan worked on feature engineering and the whole team contributed to preprocessing. Tan further refactored the preprocessing pipeline and Guillaume made the exploratory data analysis and wrote a script to easily train and compare models. Yassine started working on the business pitch.

On the second day, Agathe and Max continued to train and fine-tune models. Tan and Guillaume worked on the report. Afterwards, the whole team worked on the business pitch.

II. PROJECT MANAGEMENT

1. Data Understanding

At the start of the challenge, the team spent time on the exploratory data analysis (EDA).

We noticed that the quality of the data was not optimal. For instance, some features were missing for a large number of observations, and some continuous features contain outliers. Regarding categorical features, some of them were poorly encoded.

Following this analysis, we agreed that the preprocessing step will be very important for the quality of the predictive model.

2. Data Preprocessing

A lot of time and effort was put in data preprocessing.

Many problems were encountered because of inconsistencies in sklearn libraries that did not allow us to recover the feature names after the column transformations in pipelines. As our team have not had time to edit the sklearn classes and to fix the inconsistencies, we decided to manually edit the datasets in order to engineer the features outside of pipelines.

The training data was truncated so to remove the outliers and to include more samples that represent the majority of the data. The test data was not truncated.

Likewise, since our group emphasized interpretability for our model, we refrained from using dimensionality reduction methods such as PCA.

3. Modeling Development

To preface this section, we would like to emphasize that we completely excluded models with little interpretability such as neural networks. Hence, our group prioritized models that allow us to visualize features importance, such as linear models or forest models.

We started by training a linear model, a decision tree and a random forest as baselines. We finally decided to opt for a XGBoost model, which is a forest model with gradient boosting that is known to perform well on tabular data.

We then fine-tuned the model by performing a grid search on the hyperparameters. We also used a cross-validation to evaluate the quality of the model. The metric used to evaluate the model is the explained variance score.

4. Deployment Strategy

In the pre-inference time, the data will be preprocessed using our unique pipeline and the model will be loaded.

We plan to deploy the AI solution as Software as a Service (SaaS) on a cloud platform. This will allow us to easily scale the solution and to make it available to a large number of users. We could deploy it both as an API, to permit users to easily request the service , and also as a platform with an user-friendly interface.

Since the model is explainable, the user could also view the most important features that contribute to the energy consumption of their buildings.

The AI solution can be used either by private and public companies or private individuals.

Furthermore, in the case where we could receive new data, we could retrain the model to keep it up to date.

III. CARBON FOOTPRINT LIMITATION

The group has tried to limit the models to strictly classic Machine Learning approaches (no Deep Learning approach) and to work with relatively lightweight models whenever possible.

Energy consumption of GPUs is notoriously high. Therefore, we decided against using GPUs for training our models. Moreover, we shortened the training time by using a smaller yet better dataset.

We also slept and ate locally :)

IV. CONCLUSION

During this intense 48 hours, we have been able to develop a model that is able to predict the energy consumption of buildings with a good explained variance.

We have also been able to prepare a business pitch that leverages the model to estimate energy and cost savings. We have focused on performance and explainability to enable our future customers to anticipate, identify and solve their problems.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
notebooks		notebooks
preprocessing		preprocessing
.gitignore		.gitignore
README.md		README.md
download_data.py		download_data.py
preprocess_data_inference_model.ipynb		preprocess_data_inference_model.ipynb
preprocess_data_train_model.ipynb		preprocess_data_train_model.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HI!CKATHON 2023

I. OVERVIEW

1. Project Background and Description

2. Project Scope

3. Presentation of the group

4. Task Management

II. PROJECT MANAGEMENT

1. Data Understanding

2. Data Preprocessing

3. Modeling Development

4. Deployment Strategy

III. CARBON FOOTPRINT LIMITATION

IV. CONCLUSION

About

Releases

Packages

Languages

gphilippee/hi-paris-hackathon-2023

Folders and files

Latest commit

History

Repository files navigation

HI!CKATHON 2023

I. OVERVIEW

1. Project Background and Description

2. Project Scope

3. Presentation of the group

4. Task Management

II. PROJECT MANAGEMENT

1. Data Understanding

2. Data Preprocessing

3. Modeling Development

4. Deployment Strategy

III. CARBON FOOTPRINT LIMITATION

IV. CONCLUSION

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages