Author: Andrés Felipe Duque Bran
This repository contains a collection of scripts and modules for preprocessing, training, and evaluating an anomaly detection model applied on particle physics, specifically on the LHC Olympics 2020 challenge. The main components include data preprocessing, autoencoder training, and anomaly detection using BumpHunter. The external dataset is prepared via a Bash script, and various Python scripts are utilized for specific analysis tasks.
- Python 3.9 or higher
- Git
-
Clone the repository:
git clone https://github.com/afduquebr/Autoencoder.git cd Autoencoder
-
Prepare External Dataset:
Note: The
setup_dataset.sh
script should only be run once immediately after cloning the repository. This script clones, preprocesses and sets up the necessary datasets from the LHC Olympics challenge.chmod +x setup_dataset.sh ./setup_dataset.sh
This script will clone the dataset repository in the directory
../LHCO-Dataset
, set up a virtual environment, modify necessary files due to dependencies deprecation, and perform data clustering. -
Install Python dependencies:
Activate the virtual environment and install dependencies:
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt
- Preprocessing: The
Preprocessor
class defined inpreprocessing.py
standardizes and prepare data for training. This class is already included in the scripts for training testing and applying the algorithm.
-
Training: Define and train the autoencoder model using the
AutoEncoder
class. Usetrain.py
to handle the training pipeline. The command for training the model with 0.5% of RnD signal is shown below:python3 train.py --dataset sig --anomaly 0.5
-
Testing and Evaluation: The
test.py
script evaluates the model's performance on the test set. It includes calculating reconstruction errors, plotting histograms, and computing metrics like ROC curves and signal efficiencies. The command for testing the model with 0.5% of RnD signal is shown below:python3 test.py --dataset sig --anomaly 0.5
-
Histogram Plotting: The
hist.py
script handles the generation of histograms of the dataset and signal before and after passing through the model. The command for obtaining the histograms with 0.5% of RnD signal is shown below:python3 hist.py --dataset sig --anomaly 0.5
-
Anomaly Detection: The
apply.py
script implements the BumpHunter algorithm to identify significant deviations in the mass spectrum, indicating potential new physics signals. The command for applying the BumpHunter algorithm on the model with 0.5% of RnD signal is shown below:python3 apply.py --dataset sig --anomaly 0.5
Autoencoder
│
├── autoencoder.py # Autoencoder model and loss functions
├── preprocessing.py # Data preprocessing utilities
├── main.py # Script for specifying signal insertion
├── train.py # Main script for training the autoencoder
├── test.py # Script for testing and generating results
├── apply.py # Script for the BumpHunter analysis
├── setup_dataset.sh # Bash script to prepare LHCO datasets
├── README.md # This README file
├── requirements.txt # Python dependencies
├── figs/ # Training, testing and BH analysis figures
├── models/ # Directory containing trained models
└── selection/ # Directory containing feature selection files
This work was developed as an internship project for the completion of the programme of Master 2 Fundamental Physics and Applications in the path of Universe and Particles at the Université Clermont Auvergne, in collaboration with the Laboratoire de Physique de Clermont Auvergne. Its development was performed under the supervision of professors Julien Donini and Samuel Calvet.