python bindings for rntuple, implementation of "uproot-cpp" #15

lgray · 2023-07-04T14:12:20Z

Presently the pure python implementation of root-io uproot is an extremely effective tools connecting the root file format to the data-science and wider scientific python ecosystems.

However, uproot makes many heavy GIL-bound computations that quickly limit its scaling in multithreaded environments where we want multiple data streams to downstream processing code. This forbids interesting compute topologies like large thread-reentrant histogram filling and imposes the small tax of needing to spawn processes, each with their own python interpreter, (as opposed to threads sharing a single interpreter) to achieve parallel data processing.

Looking to the future: with RNTuple, Feather (which already has a python-bound C++ implementation for this reason), and other similar high-throughput formats, it seems prudent to develop a GIL-friendly python packages for these HEP specific data sources.

Achieving this would require a whole new implementation of uproot (perhaps focusing only on array io at first) with a cython or C(++) backend
RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings

We should find people interested in pursuing and completing these critical tasks.

The text was updated successfully, but these errors were encountered:

henryiii · 2023-07-04T14:17:58Z

FYI, there's also the per-interpreter GIL being introduced in CPython 3.12. That would allow the launching of sub-interpreters each with their own GIL, but without creating separate processes. It doesn't have a Python API in 3.12, but there will be a PyPI package allowing this to be used from Python code. (The current draft of that module is at https://pypi.org/project/interpreters-3-12/)

Don't know if that changes anything here, but something to keep in mind.

lgray · 2023-07-04T14:50:01Z

Thanks - that's good to know, but we'll be needing to deal with people using the previous interpreters for quite some time (basically until numba supports python 3.12).

jpivarski · 2023-07-04T18:28:17Z

I've been in favor of a compiled-but-Python-friendly Uproot for some time, but it's always been too large of a task—this will require dedicated effort and coordination (because I'm assuming more than one developer).

Some questions to ask about such a thing:

Perhaps the compiled language should be Julia: UnROOT.jl already exists. Can its Python bindings be developed more?
For common use-cases, precompiled is better, and scientific-python/cookie gives us the options of Scikit-Build/pybind11 for C++ and maturin for Rust.
We also shouldn't disregard the possibility of doing it in Numba, since that can be partially compiled, partially not, and it has more affinity with Python types, as well as prior expertise among likely developers. In terms of JIT technology, it's no better or worse than the Julia option (it's all LLVM).

The main difference between these three options is what people you want to or are able to get together with this. Option 1 pulls Julia developers more into the Coffea world, option 2 is for people who like blank pages, starting from scratch¹, and option 3 is for pulling it together quickly with the Python + Numba expertise that's already in this area.

Unfortunately, I'm one of those people who likes to start things from scratch, and the Rust option appeals to me. But it's more important to pull together things that already have some momentum. If the end result of this is that the Python and Julia HEP tools get more interchangeable, that's probably the best long-term win. ↩

jpivarski · 2023-07-04T21:52:32Z

Oh, I forgot one (or two) more bullet points:

Attempt to do TTree versus jumping right to RNTuple? Or maybe
Only cover NanoAOD-like TTree ("-like" means primitives and dynamically sized arrays of primitives) and RNTuple?

lgray · 2023-07-05T01:40:48Z

If UnROOT can drop the gil then we're mostly good FWIW.

jpivarski · 2023-07-05T14:52:49Z

@tamasgal and @Moelf (Jerry will be attending): we should learn more about the scope of UnROOT's reading (and writing?) capabilities—what data types does it cover?—and how easy it would be to use it in Python. Can we, for instance, read NanoAOD-like TTrees into Awkward Arrays, possibly through Arrow, in a process controlled by Python?

lgray · 2023-07-05T15:15:12Z

I talked this out a bit with @Moelf at CHEP and at zeroth order it seems possible but we both had a lot of questions about GIL-friendliness.

henryiii · 2023-07-05T15:34:40Z

I still think by the time something was worked out, 3.12 will be out, probably 3.12 compatible numba will be out, and you might be able to solve this with current uproot + intepreters-3-12, without rewriting much of anything. Might at least be worth testing with intepreters-3-12 and a 3.12 beta now (assuming you could make an interesting test without numba & maybe numpy).

Moelf · 2023-07-05T16:09:01Z

RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings

From my limited personal experience around people, I don't think this is happening soon, and regardless, a ~librntupleio.so won't come with writing capability, so I think we better just roll our own.

Regarding what UnROOT.jl can deliver, technology-wise I am optimistic about covering ~100% reading (at least for the features currently exist in RNTuple Spec). ¹

uproot read RNTuple and then sending arrow batch to Julia is viable (requires 1x more allocation, no big deal if computing non-trivial)
UnROOT.jl doesn't have writing to .root files function and lacks infrastructure (for chunk, TKey allocation etc.) that uproot has.

From the analysis-adjacent user perspective, once we move to RNTuple (which will be ~100% compatible with arrow logically speaking), I see small need for writing out to .root files if the output flows downstream, in fact there are huge amount of Arrow ecosystem ² ³ that people can leverage if they do that.

We already deal with complex RNTuple schema and nanoAOD converted by using ROOT ↩
https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/ ↩
https://arrow.apache.org/docs/python/api/cuda.html ↩

lgray · 2023-07-05T21:34:24Z

Yeah - just switching to parquet / feather after reading the files in is perfectly viable IMO.

It's just a familiarity thing (and a convenience thing), people love TBrowser.

tamasgal · 2023-07-06T19:54:13Z

The traditional (read before-RNTuple) ROOT support in UnROOT.jl is mostly limited to primitive types, (multiply nested) std containers and a couple of extra streamer logic for the usual suspects. I am already in the planning to rewrite the core parser of UnROOT since currently everything is a bit too static. Julia has great metaprogramming features which would allow a much better design, so a next development iteration cycle is definitely due. Custom streamers need a bit too much care right now (unless the branch splitting is high enough). If I only had more time... ;) Just my two cents: while I recognise all the huge benefits of RNTuple, I guess the transition phase will be fairly long (my first rough guess is that it will exceed 5 years easily) and the support for TTree-based formats will be mandatory for a very long time. A tiny example in my environment is KM3NeT which will definitely not change the low-level dataformat and will stick to ROOT TTrees for the next 20+ years. We have much more freedom in higher level formats of course, where we also utilise HDF5 and Arrow-based ones ;) That being said, as Jerry emphasised, writing ROOT files will very likely become more and more obsolete downstreams. Back to the original question from Jim: I find the idea interesting to interface UnROOT via Python but I have very little experience with using Julia in the Python context. A couple of years ago I played around with PyCall.jl to reuse some of our Python libraries in Julia, which was a bit cumbersome due to clashes with Numba JITted functions. As far as I remember that was the biggest problem and a few Cython constructs. Things have evolved since then for sure. The other way around is of course a different story. Anyways, I'll try to free up some time and play around with Julia from within Python, but I am happy if someone else explores that as well.

…

On 5. Jul 2023, at 18:09, Jerry Ling ***@***.***> wrote: RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings From my limited personal experience around people, I don't think this is happening soon, and regardless, a ~librntupleio.so won't come with writing capability period, so I think we better just roll our own. Regarding what UnROOT.jl can deliver, technology-wise I am optimistic about covering ~100% reading (at least for the features currently exist in RNTuple Spec <https://github.com/root-project/root/blob/master/tree/ntuple/v7/doc/specifications.md>). uproot read and then hand arrow batch to Julia is viable (requires 1x more allocation, no big deal if computing heavy) UnROOT.jl doesn't have writing to .root files function and lacks infrastructure (for chunk, TKey allocation etc.) that uproot has. From the analysis-adjacent user perspective, once we move to RNTuple (which will be ~100% compatible with arrow logically speaking), I see small need for writing out to .root files if the output flows downstream, in fact there are huge amount of Arrow ecosystem 1 <x-msg://8/#user-content-fn-1-698ff080f2510a705cfc9782c9147dff> 2 <x-msg://8/#user-content-fn-2-698ff080f2510a705cfc9782c9147dff> that people can leverage if they do that. Footnotes https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/ ↩ <x-msg://8/#user-content-fnref-1-698ff080f2510a705cfc9782c9147dff> https://arrow.apache.org/docs/python/api/cuda.html ↩ <x-msg://8/#user-content-fnref-2-698ff080f2510a705cfc9782c9147dff> — Reply to this email directly, view it on GitHub <#15 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANGOLUEIQISYB3J4CM5VZDXOWGSRANCNFSM6AAAAAAZ5ZSM2U>. You are receiving this because you were mentioned.

sudo-panda · 2023-07-07T18:31:38Z

+1

ianna · 2023-07-19T14:38:22Z

+1

nikoladze · 2023-07-21T13:42:50Z

+1

lgray changed the title ~~[L. Gray] python bindings for rntuple, uproot-cpp~~ [L. Gray] python bindings for rntuple, implementation of "uproot-cpp" Jul 4, 2023

lgray changed the title ~~[L. Gray] python bindings for rntuple, implementation of "uproot-cpp"~~ python bindings for rntuple, implementation of "uproot-cpp" Jul 4, 2023

lgray added topical-group Topic for discussion 2023 PyHEP.dev 2023 and removed topical-group Topic for discussion labels Jul 4, 2023

jpivarski added the topical-group Topic for discussion label Jul 7, 2023

redeboer removed the topical-group Topic for discussion label Jul 11, 2023

jpivarski mentioned this issue Aug 9, 2023

JuliaArrays: VectorOfArrays JuliaHEP/AwkwardArray.jl#3

Closed

jpivarski closed this as completed Jan 25, 2024

jpivarski mentioned this issue Aug 1, 2024

Trying to parse the root file using rust and asking some questions. scikit-hep/uproot5#1261

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python bindings for rntuple, implementation of "uproot-cpp" #15

python bindings for rntuple, implementation of "uproot-cpp" #15

lgray commented Jul 4, 2023

henryiii commented Jul 4, 2023

lgray commented Jul 4, 2023

jpivarski commented Jul 4, 2023

jpivarski commented Jul 4, 2023

lgray commented Jul 5, 2023

jpivarski commented Jul 5, 2023

lgray commented Jul 5, 2023

henryiii commented Jul 5, 2023 •

edited

Loading

Moelf commented Jul 5, 2023 •

edited

Loading

lgray commented Jul 5, 2023

tamasgal commented Jul 6, 2023 via email

sudo-panda commented Jul 7, 2023

ianna commented Jul 19, 2023

nikoladze commented Jul 21, 2023

python bindings for rntuple, implementation of "uproot-cpp" #15

python bindings for rntuple, implementation of "uproot-cpp" #15

Comments

lgray commented Jul 4, 2023

henryiii commented Jul 4, 2023

lgray commented Jul 4, 2023

jpivarski commented Jul 4, 2023

Footnotes

jpivarski commented Jul 4, 2023

lgray commented Jul 5, 2023

jpivarski commented Jul 5, 2023

lgray commented Jul 5, 2023

henryiii commented Jul 5, 2023 • edited Loading

Moelf commented Jul 5, 2023 • edited Loading

Footnotes

lgray commented Jul 5, 2023

tamasgal commented Jul 6, 2023 via email

sudo-panda commented Jul 7, 2023

ianna commented Jul 19, 2023

nikoladze commented Jul 21, 2023

henryiii commented Jul 5, 2023 •

edited

Loading

Moelf commented Jul 5, 2023 •

edited

Loading