forked from JuliaHEP/UnROOT.jl
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add skeleton * Add open journal PDF generator GHA * Fix path to source file * Set output path * summary and statement of need * Minor cosmetics * Add comparison intro and uproot * Add uproot reference * add functionality * add performance citation * small additions * for ci * Cite pandas and numpy * Add ROOT citation * Cleanup and some additions * Minor updates and additions * Update references * Cleanup and additions * Update conclusions with Jerry's comments * Fix typo * Add NanoAOD refernce * Add reference to CMS/NanoAOD * Fix pandas reference * Add spaces between words and references * Fix typo Co-authored-by: Jerry Ling <proton@jling.dev> Co-authored-by: Nick Amin <amin.nj@gmail.com>
- Loading branch information
1 parent
bbbe3d3
commit a844fd9
Showing
4 changed files
with
361 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
on: [push] | ||
|
||
jobs: | ||
paper: | ||
runs-on: ubuntu-latest | ||
name: Paper Draft | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v2 | ||
- name: Build draft PDF | ||
uses: openjournals/openjournals-draft-action@master | ||
with: | ||
journal: joss | ||
# This should be the path to the paper within your repo. | ||
paper-path: paper/paper.md | ||
- name: Upload | ||
uses: actions/upload-artifact@v1 | ||
with: | ||
name: paper | ||
# This is the output path where Pandoc will write the compiled | ||
# PDF. Note, this should be the same directory as the input | ||
# paper.md | ||
path: paper/paper.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2020 Tamas Gal, Jerry Ling and Nick Amin | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
@article{Julia, | ||
author = {Bezanson, Jeff. and Edelman, Alan. and Karpinski, Stefan. and Shah, Viral B.}, | ||
title = {Julia: A Fresh Approach to Numerical Computing}, | ||
journal = {SIAM Review}, | ||
volume = {59}, | ||
number = {1}, | ||
pages = {65-98}, | ||
year = {2017}, | ||
doi = {10.1137/141000671}, | ||
URL = { https://doi.org/10.1137/141000671}, | ||
eprint = {https://doi.org/10.1137/141000671} | ||
} | ||
|
||
@article{JuliaPerformance, | ||
title={Performance of julia for high energy physics analyses}, | ||
author={Stanitzki, Marcel and Strube, Jan}, | ||
journal={Computing and Software for Big Science}, | ||
volume={5}, | ||
number={1}, | ||
pages={1--11}, | ||
year={2021}, | ||
publisher={Springer} | ||
} | ||
|
||
@software{jim_pivarski_2021_5539722, | ||
author = {Jim Pivarski and | ||
Henry Schreiner and | ||
Nicholas Smith and | ||
Chris Burr and | ||
Dmitry Kalinkin and | ||
Giordon Stark and | ||
Nikolai Hartmann and | ||
Doug Davis and | ||
Ryunosuke O'Neil and | ||
Andrzej Novak and | ||
Ben Greiner and | ||
Beojan Stanislaus and | ||
ChristopheRappold and | ||
Cosmin Deaconu and | ||
Daniel Cervenkov and | ||
Jonas Rübenach and | ||
Josh Bendavid and | ||
Kilian Lieret and | ||
Michele Peresano and | ||
Raymond Ehlers and | ||
Ruggero Turra and | ||
Tamas Gal and | ||
Alexander Held}, | ||
title = {scikit-hep/uproot4: 4.1.3}, | ||
month = sep, | ||
year = 2021, | ||
publisher = {Zenodo}, | ||
version = {4.1.3}, | ||
doi = {10.5281/zenodo.5539722}, | ||
url = {https://doi.org/10.5281/zenodo.5539722} | ||
} | ||
@software{pivarski_jim_2018_6522027, | ||
author = {Pivarski, Jim and | ||
Osborne, Ianna and | ||
Ifrim, Ioana and | ||
Schreiner, Henry and | ||
Hollands, Angus and | ||
Biswas, Anish and | ||
Das, Pratyush and | ||
Roy Choudhury, Santam and | ||
Smith, Nicholas}, | ||
title = {Awkward Array}, | ||
month = oct, | ||
year = 2018, | ||
note = {If you use this software, please cite it as below.}, | ||
publisher = {Zenodo}, | ||
version = {1.9.0rc4}, | ||
doi = {10.5281/zenodo.6522027}, | ||
url = {https://doi.org/10.5281/zenodo.6522027} | ||
} | ||
@Article{harris2020array, | ||
title = {Array programming with {NumPy}}, | ||
author = {Charles R. Harris and K. Jarrod Millman and St{\'{e}}fan J. | ||
van der Walt and Ralf Gommers and Pauli Virtanen and David | ||
Cournapeau and Eric Wieser and Julian Taylor and Sebastian | ||
Berg and Nathaniel J. Smith and Robert Kern and Matti Picus | ||
and Stephan Hoyer and Marten H. van Kerkwijk and Matthew | ||
Brett and Allan Haldane and Jaime Fern{\'{a}}ndez del | ||
R{\'{i}}o and Mark Wiebe and Pearu Peterson and Pierre | ||
G{\'{e}}rard-Marchant and Kevin Sheppard and Tyler Reddy and | ||
Warren Weckesser and Hameer Abbasi and Christoph Gohlke and | ||
Travis E. Oliphant}, | ||
year = {2020}, | ||
month = sep, | ||
journal = {Nature}, | ||
volume = {585}, | ||
number = {7825}, | ||
pages = {357--362}, | ||
doi = {10.1038/s41586-020-2649-2}, | ||
publisher = {Springer Science and Business Media {LLC}}, | ||
url = {https://doi.org/10.1038/s41586-020-2649-2} | ||
} | ||
@software{reback2020pandas, | ||
author = {{The pandas development team}}, | ||
title = {Pandas}, | ||
month = feb, | ||
year = 2020, | ||
publisher = {Zenodo}, | ||
version = {latest}, | ||
doi = {10.5281/zenodo.3509134}, | ||
url = {https://doi.org/10.5281/zenodo.3509134} | ||
} | ||
@article{Brun:1997pa, | ||
author = "Brun, R. and Rademakers, F.", | ||
editor = "Werlen, M. and Perret-Gallix, D.", | ||
title = "{ROOT: An object oriented data analysis framework}", | ||
doi = "10.1016/S0168-9002(97)00048-X", | ||
journal = "Nucl. Instrum. Meth. A", | ||
volume = "389", | ||
pages = "81--86", | ||
year = "1997" | ||
} | ||
|
||
@article{Adri_n_Mart_nez_2016, | ||
doi = {10.1088/0954-3899/43/8/084001}, | ||
url = {https://doi.org/10.1088%2F0954-3899%2F43%2F8%2F084001}, | ||
year = 2016, | ||
month = {jun}, | ||
publisher = {{IOP} Publishing}, | ||
volume = {43}, | ||
number = {8}, | ||
pages = {084001}, | ||
author = {S Adri{\'{a} | ||
}n-Mart{\'{\i}}nez and M Ageron and F Aharonian and S Aiello and A Albert and F Ameli and E Anassontzis and M Andre and G Androulakis and M Anghinolfi and G Anton and M Ardid and T Avgitas and G Barbarino and E Barbarito and B Baret and J Barrios-Mart{\'{\i}} and B Belhorma and A Belias and E Berbee and A van den Berg and V Bertin and S Beurthey and V van Beveren and N Beverini and S Biagi and A Biagioni and M Billault and M Bond{\`{\i}} and R Bormuth and B Bouhadef and G Bourlis and S Bourret and C Boutonnet and M Bouwhuis and C Bozza and R Bruijn and J Brunner and E Buis and J Busto and G Cacopardo and L Caillat and M Calamai and D Calvo and A Capone and L Caramete and S Cecchini and S Celli and C Champion and R Cherkaoui El Moursli and S Cherubini and T Chiarusi and M Circella and L Classen and R Cocimano and J A B Coelho and A Coleiro and S Colonges and R Coniglione and M Cordelli and A Cosquer and P Coyle and A Creusot and G Cuttone and A D'Amico and G De Bonis and G De Rosa and C De Sio and F Di Capua and I Di Palma and A F D{\'{\i}}az Garc{\'{\i}}a and C Distefano and C Donzaud and D Dornic and Q Dorosti-Hasankiadeh and E Drakopoulou and D Drouhin and L Drury and M Durocher and T Eberl and S Eichie and D van Eijk and I El Bojaddaini and N El Khayati and D Elsaesser and A Enzenhöfer and F Fassi and P Favali and P Fermani and G Ferrara and C Filippidis and G Frascadore and L A Fusco and T Gal and S Galat{\`{a}} and F Garufi and P Gay and M Gebyehu and V Giordano and N Gizani and R Gracia and K Graf and T Gr{\'{e}}goire and G Grella and R Habel and S Hallmann and H van Haren and S Harissopulos and T Heid and A Heijboer and E Heine and S Henry and J J Hern{\'{a}}ndez-Rey and M Hevinga and J Hofestädt and C M F Hugon and G Illuminati and C W James and P Jansweijer and M Jongen and M de Jong and M Kadler and O Kalekin and A Kappes and U F Katz and P Keller and G Kieft and D Kie{\ss}ling and E N Koffeman and P Kooijman and A Kouchner and V Kulikovskiy and R Lahmann and P Lamare and A Leisos and E Leonora and M Lindsey Clark and A Liolios and C D Llorens Alvarez and D Lo Presti and H Löhner and A Lonardo and M Lotze and S Loucatos and E Maccioni and K Mannheim and A Margiotta and A Marinelli and O Mari{\c{s}} and C Markou and J A Mart{\'{\i}}nez-Mora and A Martini and R Mele and K W Melis and T Michael and P Migliozzi and E Migneco and P Mijakowski and A Miraglia and C M Mollo and M Mongelli and M Morganti and A Moussa and P Musico and M Musumeci and S Navas and C A Nicolau and I Olcina and C Olivetto and A Orlando and A Papaikonomou and R Papaleo and G E P{\u{a}}v{\u{a}}la{\c{s}} and H Peek and C Pellegrino and C Perrina and M Pfutzner and P Piattelli and K Pikounis and G E Poma and V Popa and T Pradier and F Pratolongo and G Pühlhofer and S Pulvirenti and L Quinn and C Racca and F Raffaelli and N Randazzo and P Rapidis and P Razis and D Real and L Resvanis and J Reubelt and G Riccobene and C Rossi and A Rovelli and M Salda{\~{n}}a and I Salvadori and D F E Samtleben and A S{\'{a}}nchez Garc{\'{\i}}a and A S{\'{a}}nchez Losa and M Sanguineti and A Santangelo and D Santonocito and P Sapienza and F Schimmel and J Schmelling and V Sciacca and M Sedita and T Seitz and I Sgura and F Simeone and I Siotis and V Sipala and B Spisso and M Spurio and G Stavropoulos and J Steijger and S M Stellacci and D Stransky and M Taiuti and Y Tayalati and D T{\'{e}}zier and S Theraube and L Thompson and P Timmer and C Tönnis and L Trasatti and A Trovato and A Tsirigotis and S Tzamarias and E Tzamariudaki and B Vallage and V Van Elewyck and J Vermeulen and P Vicini and S Viola and D Vivolo and M Volkert and G Voulgaris and L Wiggers and J Wilms and E de Wolf and K Zachariadou and J D Zornoza and J Z{\'{u}}{\~{n}}iga}, | ||
title = {Letter of intent for {KM}3NeT 2.0}, | ||
journal = {Journal of Physics G: Nuclear and Particle Physics} | ||
} | ||
@article{Ehataht:2020ebp, | ||
author = {Ehat\"aht, Karl}, | ||
editor = "Doglioni, C. and Kim, D. and Stewart, G. A. and Silvestris, L. and Jackson, P. and Kamleh, W.", | ||
collaboration = "CMS", | ||
title = "{NANOAOD: a new compact event data format in CMS}", | ||
doi = "10.1051/epjconf/202024506002", | ||
journal = "EPJ Web Conf.", | ||
volume = "245", | ||
pages = "06002", | ||
year = "2020" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
--- | ||
title: 'UnROOT: an I/O library for the CERN ROOT file format written in Julia' | ||
tags: | ||
- Julia | ||
- HEP | ||
authors: | ||
- name: Tamás Gál | ||
orcid: 0000-0001-7821-8673 | ||
affiliation: "1, 2" | ||
- name: Jerry (Jiahong) Ling | ||
orcid: 0000-0002-3359-0380 | ||
affiliation: "3" | ||
- name: Nick Amin | ||
orcid: 0000-0003-2560-0013 | ||
affiliation: "4" | ||
affiliations: | ||
- name: Erlangen Centre for Astroparticle Physics | ||
index: 1 | ||
- name: Friedrich-Alexander-Universität Erlangen-Nürnberg | ||
index: 2 | ||
- name: Harvard University | ||
index: 3 | ||
- name: University of California, Santa Barbara | ||
index: 4 | ||
date: 08 October 2021 | ||
bibliography: paper.bib | ||
--- | ||
# Summary | ||
`UnROOT.jl` is a pure Julia implementation of CERN ROOT [@Brun:1997pa] files I/O | ||
(`.root`) that is fast, memory-efficient, and composes well with Julia's | ||
high-performance iteration, array, and multi-threading interfaces. | ||
|
||
# Statement of need | ||
The High-Energy Physics (HEP) community has been troubled by the two-language | ||
problem for a long time. Often, physicists would start prototyping with a | ||
`Python` front-end which glues to a `C/C++/Fortran` back-end. Soon they will hit | ||
a task which is extremely hard to express in columnar (i.e. "vectorized") style, | ||
a type of problems which are normally tackled with libraries like | ||
`numpy` [@harris2020array] or `pandas` [@reback2020pandas]. This usually leads to | ||
either writing `C++` kernels and interface them with `Python`, or porting the | ||
prototype to `C++` and start to maintain two code bases including the wrapper | ||
code. Both options are engineering challenges for physicists who usually have no | ||
or little background in software engineering. Many steps of this process are | ||
critical, like identifying bottlenecks, creating an architecture which is both | ||
performant and maintainable at the same time while still being user-friendly and | ||
logically structured. Using a `Python` front-end and dancing across language | ||
barriers also hinders the ability to parallelize tasks that are conceptually | ||
trivial most of the time. | ||
|
||
`UnROOT.jl` attempts to solve all of the above by choosing Julia, a | ||
high-performance language with simple and expressive syntax [@Julia]. Julia is | ||
designed to solve the two-language problem in general. This has been studied for | ||
HEP specifically as well [@JuliaPerformance]. Analysis software written in Julia | ||
can freely escape to a `for-loop` whenever vectorized-style processing is not | ||
flexible enough, without any performance degradation. At the same time, | ||
`UnROOT.jl` transparently supports multi-threading and multi-processing by | ||
simply providing data structures which are a subtype of `AbstractArray`, the | ||
built-in abstract type for array-like objects, which allows to interface with | ||
array-routines from other packages easily, thanks to multiple dispatch, one of | ||
the main features of Julia. | ||
|
||
# Features and Functionality | ||
|
||
The `ROOT` dataformat is flexible and mostly self-descriptive. Users can define | ||
their own data structures (C++ classes) which derive from `ROOT` classes and | ||
serialise them into directories, trees and branches. The information about the | ||
deserialisation is written to the output file (therefore: self-descriptive) but | ||
there are some basic structures and constants needed to bootstrap the parsing | ||
process. One of the biggest advantages of the `ROOT` data format is the ability | ||
to store jagged structures like nested arrays of structs with different | ||
sub-array lengths. In high-energy physics, such structures are preferred to | ||
resemble e.g. particle interactions and detector responses as signals from | ||
different hardware components, combined into a tree of events. | ||
|
||
`UnROOT.jl` understands the core structure of `ROOT` files, and is able to | ||
decompress and deserialize instances of the commonly used `TH1`, `TH2`, | ||
`TDirectory`, `TTree` etc. ROOT classes. All basic C++ types for `TTree` | ||
branches are supported as well, including their nested variants. Additionally, | ||
`UnROOT.jl` provides a way to hook into the deserialisation process of custom | ||
types where the automatic parsing fails. By the time of writing, `UnROOT` is | ||
already used successfully in the data analysis of the KM3NeT neutrino | ||
telescope [@Adri_n_Mart_nez_2016] and the CMS detector [@Ehataht:2020ebp]. | ||
|
||
Opening and loading a `TTree` lazily -- i.e. without reading the whole data into | ||
memory -- is simple: | ||
|
||
```julia | ||
julia> using UnROOT | ||
|
||
julia> f = ROOTFile("test/samples/NanoAODv5_sample.root") | ||
ROOTFile with 2 entries and 21 streamers. | ||
test/samples/NanoAODv5_sample.root | ||
Events | ||
"run" | ||
"luminosityBlock" | ||
"event" | ||
"HTXS_Higgs_pt" | ||
"HTXS_Higgs_y" | ||
... | ||
|
||
julia> mytree = LazyTree(f, "Events", ["Electron_dxy", "nMuon", r"Muon_(pt|eta)$"]) | ||
Row Electron_dxy nMuon Muon_eta Muon_pt | ||
Vector{Float32} UInt32 Vector{Float32} Vector{Float32} | ||
|
||
1 [0.000371] 0 [] [] | ||
2 [-0.00982] 2 [0.53, 0.229] [19.9, 15.3] | ||
3 [] 0 [] [] | ||
4 [-0.00157] 0 [] [] | ||
... | ||
``` | ||
|
||
As seen in the above example, the entries in the columns are multi-dimensional | ||
and jagged. The `LazyTree` object acts as a table which suports sequential | ||
or parallel iteration, selections and filtering based on ranges or masks, and | ||
operations on whole columns: | ||
|
||
```julia | ||
for event in mytree | ||
# ... Operate on event | ||
end | ||
|
||
Threads.@threads for event in mytree # multi-threading | ||
# ... Operate on event | ||
end | ||
|
||
mytree.Muon_pt # a column as a lazy vector of vectors | ||
``` | ||
|
||
The `LazyTree` is designed as `<: AbstractArray` which makes it compose well | ||
with the rest of the Julia ecosystem. For example, syntactic loop fusion [^1] or | ||
Query-style tabular manipulations provided by packages like `Query.jl` [^2] without | ||
any additional code support just work out-of-the-box. | ||
|
||
# Comparison with existing software | ||
|
||
This section focusses on the comparison with other existing ROOT I/O solutions | ||
in the Julia universe, however, one honorable mention is `uproot` | ||
[@jim_pivarski_2021_5539722], which is a purely Python-based ROOT I/O library | ||
and played (still plays) an important role for the development of `UnROOT.jl` as | ||
it is by the time of writing the most complete and best documented ROOT I/O | ||
implementation. | ||
|
||
- `UpROOT.jl` is a wrapper for the aforementioned `uproot` Python package and | ||
uses `PyCall.jl` [^3] as a bridge, which means that it relies on `Python` as a | ||
glue language. In addition to that, `uproot` itself utilises the C++ library | ||
`AwkwardArray` [@pivarski_jim_2018_6522027] to efficiently deal with jagged | ||
data in `ROOT` files. Most of the features of `uproot` are available in the | ||
Julia context, but there are intrinsic performance and usability drawbacks due | ||
to the three language architecture. | ||
|
||
- `ROOT.jl` [^4] is one of the oldest Julia `ROOT` packages. It uses C++ bindings to | ||
directly wrap the `ROOT` framework and therefore is not limited ot I/O. | ||
Unfortunately, the `Cxx.jl` [^5] package which is used to generate the C++ glue | ||
code does not support Julia 1.4 or later. The multi-threaded features are also | ||
limited. | ||
|
||
# Conclusion | ||
|
||
`UnROOT.jl` is an important package in high-energy physics and related | ||
scientific fields where the `ROOT` dataformat is established, since the ability | ||
to read and parse scientific data is certainly the first mandatory step to open | ||
the window to a programming language and its package ecosystem. `UnROOT.jl` has | ||
demonstrated tree processing speeds at the same level as the `C++` `ROOT` | ||
framework in per-event iteration as well as the Python-based `uproot` library in | ||
chunked iteration. | ||
|
||
# References | ||
|
||
|
||
[^1]: https://julialang.org/blog/2017/01/moredots/ | ||
[^2]: https://github.com/queryverse/Query.jl | ||
[^3]: https://github.com/JuliaPy/PyCall.jl | ||
[^4]: https://github.com/JuliaHEP/ROOT.jl | ||
[^5]: https://github.com/JuliaInterop/Cxx.jl |