Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JOSS Paper #135

Merged
merged 25 commits into from
Jun 3, 2022
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .github/workflows/draft-pdf.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
on: [push]

jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
# This should be the path to the paper within your repo.
paper-path: paper/paper.md
- name: Upload
uses: actions/upload-artifact@v1
with:
name: paper
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper/paper.pdf
21 changes: 21 additions & 0 deletions paper/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2020 Tamas Gal, Jerry Ling and Nick Amin

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
143 changes: 143 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
@article{Julia,
author = {Bezanson, Jeff. and Edelman, Alan. and Karpinski, Stefan. and Shah, Viral B.},
title = {Julia: A Fresh Approach to Numerical Computing},
journal = {SIAM Review},
volume = {59},
number = {1},
pages = {65-98},
year = {2017},
doi = {10.1137/141000671},
URL = { https://doi.org/10.1137/141000671},
eprint = {https://doi.org/10.1137/141000671}
}

@article{JuliaPerformance,
title={Performance of julia for high energy physics analyses},
author={Stanitzki, Marcel and Strube, Jan},
journal={Computing and Software for Big Science},
volume={5},
number={1},
pages={1--11},
year={2021},
publisher={Springer}
}

@software{jim_pivarski_2021_5539722,
author = {Jim Pivarski and
Henry Schreiner and
Nicholas Smith and
Chris Burr and
Dmitry Kalinkin and
Giordon Stark and
Nikolai Hartmann and
Doug Davis and
Ryunosuke O'Neil and
Andrzej Novak and
Ben Greiner and
Beojan Stanislaus and
ChristopheRappold and
Cosmin Deaconu and
Daniel Cervenkov and
Jonas Rübenach and
Josh Bendavid and
Kilian Lieret and
Michele Peresano and
Raymond Ehlers and
Ruggero Turra and
Tamas Gal and
Alexander Held},
title = {scikit-hep/uproot4: 4.1.3},
month = sep,
year = 2021,
publisher = {Zenodo},
version = {4.1.3},
doi = {10.5281/zenodo.5539722},
url = {https://doi.org/10.5281/zenodo.5539722}
}
@software{pivarski_jim_2018_6522027,
author = {Pivarski, Jim and
Osborne, Ianna and
Ifrim, Ioana and
Schreiner, Henry and
Hollands, Angus and
Biswas, Anish and
Das, Pratyush and
Roy Choudhury, Santam and
Smith, Nicholas},
title = {Awkward Array},
month = oct,
year = 2018,
note = {If you use this software, please cite it as below.},
publisher = {Zenodo},
version = {1.9.0rc4},
doi = {10.5281/zenodo.6522027},
url = {https://doi.org/10.5281/zenodo.6522027}
}
@Article{harris2020array,
title = {Array programming with {NumPy}},
author = {Charles R. Harris and K. Jarrod Millman and St{\'{e}}fan J.
van der Walt and Ralf Gommers and Pauli Virtanen and David
Cournapeau and Eric Wieser and Julian Taylor and Sebastian
Berg and Nathaniel J. Smith and Robert Kern and Matti Picus
and Stephan Hoyer and Marten H. van Kerkwijk and Matthew
Brett and Allan Haldane and Jaime Fern{\'{a}}ndez del
R{\'{i}}o and Mark Wiebe and Pearu Peterson and Pierre
G{\'{e}}rard-Marchant and Kevin Sheppard and Tyler Reddy and
Warren Weckesser and Hameer Abbasi and Christoph Gohlke and
Travis E. Oliphant},
year = {2020},
month = sep,
journal = {Nature},
volume = {585},
number = {7825},
pages = {357--362},
doi = {10.1038/s41586-020-2649-2},
publisher = {Springer Science and Business Media {LLC}},
url = {https://doi.org/10.1038/s41586-020-2649-2}
}
@software{reback2020pandas,
author = {{The pandas development team}},
title = {Pandas},
month = feb,
year = 2020,
publisher = {Zenodo},
version = {latest},
doi = {10.5281/zenodo.3509134},
url = {https://doi.org/10.5281/zenodo.3509134}
}
@article{Brun:1997pa,
author = "Brun, R. and Rademakers, F.",
editor = "Werlen, M. and Perret-Gallix, D.",
title = "{ROOT: An object oriented data analysis framework}",
doi = "10.1016/S0168-9002(97)00048-X",
journal = "Nucl. Instrum. Meth. A",
volume = "389",
pages = "81--86",
year = "1997"
}

@article{Adri_n_Mart_nez_2016,
doi = {10.1088/0954-3899/43/8/084001},
url = {https://doi.org/10.1088%2F0954-3899%2F43%2F8%2F084001},
year = 2016,
month = {jun},
publisher = {{IOP} Publishing},
volume = {43},
number = {8},
pages = {084001},
author = {S Adri{\'{a}
}n-Mart{\'{\i}}nez and M Ageron and F Aharonian and S Aiello and A Albert and F Ameli and E Anassontzis and M Andre and G Androulakis and M Anghinolfi and G Anton and M Ardid and T Avgitas and G Barbarino and E Barbarito and B Baret and J Barrios-Mart{\'{\i}} and B Belhorma and A Belias and E Berbee and A van den Berg and V Bertin and S Beurthey and V van Beveren and N Beverini and S Biagi and A Biagioni and M Billault and M Bond{\`{\i}} and R Bormuth and B Bouhadef and G Bourlis and S Bourret and C Boutonnet and M Bouwhuis and C Bozza and R Bruijn and J Brunner and E Buis and J Busto and G Cacopardo and L Caillat and M Calamai and D Calvo and A Capone and L Caramete and S Cecchini and S Celli and C Champion and R Cherkaoui El Moursli and S Cherubini and T Chiarusi and M Circella and L Classen and R Cocimano and J A B Coelho and A Coleiro and S Colonges and R Coniglione and M Cordelli and A Cosquer and P Coyle and A Creusot and G Cuttone and A D'Amico and G De Bonis and G De Rosa and C De Sio and F Di Capua and I Di Palma and A F D{\'{\i}}az Garc{\'{\i}}a and C Distefano and C Donzaud and D Dornic and Q Dorosti-Hasankiadeh and E Drakopoulou and D Drouhin and L Drury and M Durocher and T Eberl and S Eichie and D van Eijk and I El Bojaddaini and N El Khayati and D Elsaesser and A Enzenhöfer and F Fassi and P Favali and P Fermani and G Ferrara and C Filippidis and G Frascadore and L A Fusco and T Gal and S Galat{\`{a}} and F Garufi and P Gay and M Gebyehu and V Giordano and N Gizani and R Gracia and K Graf and T Gr{\'{e}}goire and G Grella and R Habel and S Hallmann and H van Haren and S Harissopulos and T Heid and A Heijboer and E Heine and S Henry and J J Hern{\'{a}}ndez-Rey and M Hevinga and J Hofestädt and C M F Hugon and G Illuminati and C W James and P Jansweijer and M Jongen and M de Jong and M Kadler and O Kalekin and A Kappes and U F Katz and P Keller and G Kieft and D Kie{\ss}ling and E N Koffeman and P Kooijman and A Kouchner and V Kulikovskiy and R Lahmann and P Lamare and A Leisos and E Leonora and M Lindsey Clark and A Liolios and C D Llorens Alvarez and D Lo Presti and H Löhner and A Lonardo and M Lotze and S Loucatos and E Maccioni and K Mannheim and A Margiotta and A Marinelli and O Mari{\c{s}} and C Markou and J A Mart{\'{\i}}nez-Mora and A Martini and R Mele and K W Melis and T Michael and P Migliozzi and E Migneco and P Mijakowski and A Miraglia and C M Mollo and M Mongelli and M Morganti and A Moussa and P Musico and M Musumeci and S Navas and C A Nicolau and I Olcina and C Olivetto and A Orlando and A Papaikonomou and R Papaleo and G E P{\u{a}}v{\u{a}}la{\c{s}} and H Peek and C Pellegrino and C Perrina and M Pfutzner and P Piattelli and K Pikounis and G E Poma and V Popa and T Pradier and F Pratolongo and G Pühlhofer and S Pulvirenti and L Quinn and C Racca and F Raffaelli and N Randazzo and P Rapidis and P Razis and D Real and L Resvanis and J Reubelt and G Riccobene and C Rossi and A Rovelli and M Salda{\~{n}}a and I Salvadori and D F E Samtleben and A S{\'{a}}nchez Garc{\'{\i}}a and A S{\'{a}}nchez Losa and M Sanguineti and A Santangelo and D Santonocito and P Sapienza and F Schimmel and J Schmelling and V Sciacca and M Sedita and T Seitz and I Sgura and F Simeone and I Siotis and V Sipala and B Spisso and M Spurio and G Stavropoulos and J Steijger and S M Stellacci and D Stransky and M Taiuti and Y Tayalati and D T{\'{e}}zier and S Theraube and L Thompson and P Timmer and C Tönnis and L Trasatti and A Trovato and A Tsirigotis and S Tzamarias and E Tzamariudaki and B Vallage and V Van Elewyck and J Vermeulen and P Vicini and S Viola and D Vivolo and M Volkert and G Voulgaris and L Wiggers and J Wilms and E de Wolf and K Zachariadou and J D Zornoza and J Z{\'{u}}{\~{n}}iga},
title = {Letter of intent for {KM}3NeT 2.0},
journal = {Journal of Physics G: Nuclear and Particle Physics}
}
@article{Ehataht:2020ebp,
author = {Ehat\"aht, Karl},
editor = "Doglioni, C. and Kim, D. and Stewart, G. A. and Silvestris, L. and Jackson, P. and Kamleh, W.",
collaboration = "CMS",
title = "{NANOAOD: a new compact event data format in CMS}",
doi = "10.1051/epjconf/202024506002",
journal = "EPJ Web Conf.",
volume = "245",
pages = "06002",
year = "2020"
}
174 changes: 174 additions & 0 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
---
title: 'UnROOT: an I/O library for the CERN ROOT file format written in Julia'
tags:
- Julia
- HEP
authors:
- name: Tamás Gál
orcid: 0000-0001-7821-8673
affiliation: "1, 2"
- name: Jerry (Jiahong) Ling
orcid: 0000-0002-3359-0380
affiliation: "3"
- name: Nick Amin
orcid: 0000-0003-2560-0013
affiliation: "4"
affiliations:
- name: Erlangen Centre for Astroparticle Physics
index: 1
- name: Friedrich-Alexander-Universität Erlangen-Nürnberg
index: 2
- name: Harvard University
index: 3
- name: University of California, Santa Barbara
index: 4
date: 08 October 2021
bibliography: paper.bib
---
# Summary
`UnROOT.jl` is a pure Julia implementation of CERN ROOT [@Brun:1997pa] files I/O
(`.root`) that is fast, memory-efficient, and composes well with Julia's
high-performance iteration, array, and multi-threading interfaces.

# Statement of need
The High-Energy Physics (HEP) community has been troubled by the two-language
problem for a long time. Often, physicists would start prototyping with a
`Python` front-end which glues to a `C/C++/Fortran` back-end. Soon they will hit
a task which is extremely hard to express in columnar (i.e. "vectorized") style,
a type of problems which are normally tackled with libraries like
`numpy` [@harris2020array] or `pandas` [@reback2020pandas]. This usually leads to
either writing `C++` kernels and interface them with `Python`, or porting the
prototype to `C++` and start to maintain two code bases including the wrapper
code. Both options are engineering challenges for physicists who usually have no
or little background in software engineering. Many steps of this process are
critical, like identifying bottlenecks, creating an architecture which is both
performant and maintainable at the same time while still being user-friendly and
logically structured. Using a `Python` front-end and dancing across language
barriers also hinders the ability to parallelize tasks that are conceptually
trivial most of the time.

`UnROOT.jl` attempts to solve all of the above by choosing Julia, a
high-performance language with simple and expressive syntax [@Julia]. Julia is
designed to solve the two-language problem in general. This has been studied for
HEP specifically as well [@JuliaPerformance]. Analysis software written in Julia
can freely escape to a `for-loop` whenever vectorized-style processing is not
flexible enough, without any performance degradation. At the same time,
`UnROOT.jl` transparently supports multi-threading and multi-processing by
simply providing data structures which are a subtype of `AbstractArray`, the
built-in abstract type for array-like objects, which allows to interface with
array-routines from other packages easily, thanks to multiple dispatch, one of
the main features of Julia.

# Features and Functionality

The `ROOT` dataformat is flexible and mostly self-descriptive. Users can define
their own data structures (C++ classes) which derive from `ROOT` classes and
serialise them into directories, trees and branches. The information about the
deserialisation is written to the output file (therfore: self-descriptive) but
tamasgal marked this conversation as resolved.
Show resolved Hide resolved
there are some basic structures and constants needed to bootstrap the parsing
process. One of the biggest advantages of the `ROOT` data format is the ability
to store jagged structures like nested arrays of structs with different
sub-array lengths. In high-energy physics, such structures are preferred to
resemble e.g. particle interactions and detector responses as signals from
different hardware components, combined into a tree of events.

`UnROOT.jl` understands the core structure of `ROOT` files, and is able to
decompress and deserialize instances of the commonly used `TH1`, `TH2`,
`TDirectory`, `TTree` etc. ROOT classes. All basic C++ types for `TTree`
branches are supported as well, including their nested variants. Additionally,
`UnROOT.jl` provides a way to hook into the deserialisation process of custom
types where the automatic parsing fails. By the time of writing, `UnROOT` is
already used successfully in the data analysis of the KM3NeT neutrino
telescope [@Adri_n_Mart_nez_2016] and the CMS detector [@Ehataht:2020ebp].

Opening and loading a `TTree` lazily -- i.e. without reading the whole data into
memory -- is simple:

```julia
julia> using UnROOT

julia> f = ROOTFile("test/samples/NanoAODv5_sample.root")
ROOTFile with 2 entries and 21 streamers.
test/samples/NanoAODv5_sample.root
Events
"run"
"luminosityBlock"
"event"
"HTXS_Higgs_pt"
"HTXS_Higgs_y"
...

julia> mytree = LazyTree(f, "Events", ["Electron_dxy", "nMuon", r"Muon_(pt|eta)$"])
Row Electron_dxy nMuon Muon_eta Muon_pt
Vector{Float32} UInt32 Vector{Float32} Vector{Float32}

1 [0.000371] 0 [] []
2 [-0.00982] 2 [0.53, 0.229] [19.9, 15.3]
3 [] 0 [] []
4 [-0.00157] 0 [] []
...
```

As seen in the above example, the entries in the columns are multi-dimensional
and jagged. The `LazyTree` object acts as a table which suports sequential
or parallel iteration, selections and filtering based on ranges or masks, and
operations on whole columns:

```julia
for event in mytree
# ... Operate on event
end

Threads.@threads for event in mytree # multi-threading
# ... Operate on event
end

mytree.Muon_pt # a column as a lazy vector of vectors
```

The `LazyTree` is designed as `<: AbstractArray` which makes it compose well
with the rest of the Julia ecosystem. For example, syntactic loop fusion [^1] or
Query-style tabular manipulations provided by packages like `Query.jl` [^2] without
any additional code support just work out-of-the-box.

# Comparison with existing software

This section focusses on the comparison with other existing ROOT I/O solutions
in the Julia universe, however, one honorable mention is `uproot`
[@jim_pivarski_2021_5539722], which is a purely Python-based ROOT I/O library
and played (still plays) an important role for the development of `UnROOT.jl` as
it is by the time of writing the most complete and best documented ROOT I/O
implementation.

- `UpROOT.jl` is a wrapper for the aforementioned `uproot` Python package and
uses `PyCall.jl` [^3] as a bridge, which means that it relies on `Python` as a
glue language. In addition to that, `uproot` itself utilises the C++ library
`AwkwardArray` [@pivarski_jim_2018_6522027] to efficiently deal with jagged
data in `ROOT` files. Most of the features of `uproot` are available in the
Julia context, but there are intrinsic performance and usability drawbacks due
to the three language architecture.

- `ROOT.jl` [^4] is one of the oldest Julia `ROOT` packages. It uses C++ bindings to
directly wrap the `ROOT` framework and therefore is not limited ot I/O.
Unfortunately, the `Cxx.jl` [^5] package which is used to generate the C++ glue
code does not support Julia 1.4 or later. The multi-threaded features are also
limited.

# Conclusion

`UnROOT.jl` is an important package in high-energy physics and related
scientific fields where the `ROOT` dataformat is established, since the ability
to read and parse scientific data is certainly the first mandatory step to open
the window to a programming language and its package ecosystem. `UnROOT.jl` has
demonstrated tree processing speeds at the same level as the `C++` `ROOT`
framework in per-event iteration as well as the Python-based `uproot` library in
chunked iteration.

# References


[^1]: https://julialang.org/blog/2017/01/moredots/
[^2]: https://github.com/queryverse/Query.jl
[^3]: https://github.com/JuliaPy/PyCall.jl
[^4]: https://github.com/JuliaHEP/ROOT.jl
[^5]: https://github.com/JuliaInterop/Cxx.jl