SparkDLTrigger

This repository contains the code and notebooks accompanying the blog article Machine Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and Analytics Zoo.
Principal author of the notebooks: Matteo.Migliorini@cern.ch
Contacts: Luca.Canali@cern.ch; Matteo.Migliorini@cern.ch; Riccardo.Castellotti@cern.ch
Acknowledgements: Viktor Khristenko, Maria Girone, Maurizio Pierini, Thong Nhuyen, Members of the Hadoop and Spark service at CERN
Intel for BigDL and Analytics Zoo consultancy: Jiao (Jennie) Wang and Sajan Govindan

Physics use case

Event data flows collected from the particle detector (CMS experiment) contains different types of event topologies of interest. A particle classifier built with neural networks can be used as event filter, improving state of the art in accuracy.
This work reproduces the findings of the paper Topology classification with deep learning to improve real-time event selection at the LHC using tools from the Big Data ecosystem, notably Apache Spark and BigDL/Analytics Zoo.

Data pipeline

Data pipelines are of paramount importance to make machine learning projects successful, by integrating multiple components and APIs used for data processing across the entire data chain. A good data pipeline implementation can accelerate and improve the productivity of the work around the core machine learning tasks. The four steps of the pipeline we built are:

Data Ingestion: where we read data from ROOT format and from the CERN-EOS storage system, into a Spark DataFrame and save the results as a table stored in Apache Parquet files
Feature Engineering and Event Selection: where the Parquet files containing all the events details processed in Data Ingestion are filtered and datasets with new features are produced
Parameter Tuning: where the best set of hyperparameters for each model architecture are found performing a grid search
Training: where the best models found in the previous step are trained on the entire dataset.

Results

The resutls of the DL model(s) training are satisfactoy and match the resutls of the original research paper.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
DataIngestion_FeaturePreparation		DataIngestion_FeaturePreparation
Spark_BigDL_K8s		Spark_BigDL_K8s
Training_BigDL_Zoo		Training_BigDL_Zoo
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparkDLTrigger

Physics use case

Data pipeline

Results

Additional info

About

Releases

Packages

Languages

License

Glorf/SparkDLTrigger

Folders and files

Latest commit

History

Repository files navigation

SparkDLTrigger

Physics use case

Data pipeline

Results

Additional info

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages