This project aims at classifying the decay signature of events measured by the Large Hadron Collider at CERN, predicting whether it's the one of a Higgs Boson or not, thanks to Logistic Regression.
The problem was part of a Machine Learning challenge from AICrowd. Our team, called pasta-balalaika, reached the position 50/307 on the leaderboard, with an F1 score of 0.74 and an accuracy of 0.82. This project was also done as an assignment of the EPFL course CS-433 Machine Learning.
The Higgs boson is an elementary particle in the Standard Model of physics which explains why other particles have mass. Its discovery at the Large Hadron Collider at CERN was announced in March 2013.
In this project, we applied machine learning techniques to actual CERN particle accelerator data to recreate the process of “discovering” the Higgs particle. Physicists at CERN smash protons into one another at high speeds to generate even smaller particles as by-products of the collisions. Rarely, these collisions can produce a Higgs boson. Since the Higgs boson decays rapidly into other particles, scientists don’t observe it directly, but rather measure its “decay signature”, or the products that result from its decay process.
Since many decay signatures look similar, we estimated the likelihood that a given event’s signature was the result of a Higgs boson (signal) or some other process/particle (background). To do this, we implemented a pre-processing pipeline and different binary classification techniques and compared their performance with hyperparameters tuning and cross validation.
Download this repository as a zip file and extract it into a folder The easiest way to run the code is to install Anaconda 3 distribution (available for Windows, macOS and Linux). To do so, follow the guidelines from the official website (select python of version 3): https://www.anaconda.com/download/
Additional package versions are specified in the requirements.txt file , you can just run the following command on Anaconda Prompt (anaconda3):
cd *THE_FOLDER_PATH_WHERE_YOU_DOWNLOADED_AND_EXTRACTED_THIS_REPOSITORY*
conda install --file requirements.txt
Download the training and testing datasets here (logging into AICrowd might be required to download)
Then, just run run.py with the following command to train and test the model:
python run.py
-
experiments/experiments_models.ipynb : this Jupyter notebook contains our cross validation and hyperparameter experiments with different models
-
experiments/experiments_preprocesing.ipynb: this Jupyter notebook contains our experiments with different preprocessing techniques
-
experiments/generate_graphs.ipynb: notebook that generates the graphs for the paper
-
helper.py: contains helper functions which were used for setting up our experiments
-
implementations.py: contains 6 default required funcitons + additional minimization algorithms, and accoring loss funcitons
-
metrics.py: contains our implementations of different metrics
-
preprocessing.py: contains methods for the preprocessing of data
-
report.pdf: pdf with the report of the project
-
run.py: contains the code for reproducing our best submission file
-
utils.py: miscellaneous other functions, e.g. loading data, splitting it, etc..
-
requirements.txt: file which includes package requirements for running the code
For further details on the implementation choice and the experiments, please read the report.pdf file.
Python, PyTorch, Matplotlib, Jupyter Notebooks. Machine learning, Logistic Regression, analysis of the impact of different preprocessing techniques on training, shallow modelling, plotting the experiments, ensuring reproducibility.