This project served as my MSc thesis at the Indian Institute of Technology, Delhi, under the supervision of Professor Abhishek M. Iyer, Department of Physics. The objective was to utilize ML techniques to analyze particle collisions at the Large Hadron Collider and detect anomalies.
The project works on the classification of image data. These simulated images are called Jet images, which represent collimated sprays of particles resulting from the particle collisions. The task was to employ several Machine Learning classification techniques to categorize the Jet images into two distinct classes: Signal and Background events.
An example of Jet images is given below:
The first ML algorithm employed was PCA, a dimensionality reduction technique. This method follows from the Eigenfaces method of facial recognition. It enables the comparison of images based on their proximity in this reduced feature space. The accuracy achieved on the test set was an impressive 98.04%.
Transitioning to CNNs, we used deep learning in image analysis. The CNN architecture we modeled is shown in the image below:
We trained the algorithm on two different datasets:
- Dataset 1: This was a balanced dataset of 100 images of each class, resulting in a total of 200 images of dimensions (480, 640, 3). The accuracy achieved by the CNN model on the test set after training for 40 epochs was 77.4%. The ROC curve is shown below.
- Dataset 2: This was also a balanced dataset of 200 images of each class, resulting in a total of 400 images of dimensions (369, 369, 3). The accuracy achieved by the CNN model on the test set after training for 38 epochs was 92.5%. The ROC curve is shown below.
To further increase the accuracies of classification, the project pivoted to better pre-processing techniques on the images. A methodology for reconstructing images from recorded data was used which was formulated as an optimization problem. The optimization was achieved by the Gradient Descent Algorithm.
This method was implemented using two Python libraries: Numpy and Autograd. The results were compared with another preprocessing algorithm called 'Weiner Filter Recovery'. Mean Signal and Background images after employment of these images are shown below:
To quantify the separation of classes after pre-processing, the Bhattacharya Distance was calculated between the mean Signal and mean Background images. The results were:
- Weiner: 0.279
- Autograd: 0.332
- Numpy: 0.151
To further test efficiency, the preprocessed images were run through the CNN to determine changes in accuracy. The unprocessed images gave an accuracy of 41.2%. The processed images gave the following accuracies:
- Autograd: 76.2%
- Numpy: 62.5%
Therefore, it was concluded that the preprocessing done by Autograd was the most useful in classification.
At the culmination of this project, the algorithms were applied to the real Large Hadron Collider data available on the CERN website: Photon data and Electron data. The two classes in this dataset corresponded to Electrons and Photons, real-life examples of signal and background events. The dataset contained 40,000 images of dimensions (32,32,2) each, the two channels representing Energy and Time. An example image of this dataset is given below:
The Jupyter Notebook for this portion of the project shows three CNN models developed for this dataset. The model which gave the best accuracy had the following architecture:
The accuracy achieved was 67.3%.
- Jet-Images– Deep Learning Edition - De Oliveira, L., Kagan, M., Mackey, L., Nachman, B., Schwartzman, A. (2015)
- Eigenfaces for Face Recognition, Matthew Turk and Alex Pentland (1991)
- Large Hadron Collider, CERN
- Collider Physics Lecture I - Sreerup Raychaudhuri, TIFR
- Fourier Optics and Computational Imaging by Kedar Khare
- Introduction To Elementary Particles 3rd Edition, David Griffiths