Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python bindings for rntuple, implementation of "uproot-cpp" #15

Closed
lgray opened this issue Jul 4, 2023 · 14 comments
Closed

python bindings for rntuple, implementation of "uproot-cpp" #15

lgray opened this issue Jul 4, 2023 · 14 comments
Labels
2023 PyHEP.dev 2023

Comments

@lgray
Copy link
Collaborator

lgray commented Jul 4, 2023

Presently the pure python implementation of root-io uproot is an extremely effective tools connecting the root file format to the data-science and wider scientific python ecosystems.

However, uproot makes many heavy GIL-bound computations that quickly limit its scaling in multithreaded environments where we want multiple data streams to downstream processing code. This forbids interesting compute topologies like large thread-reentrant histogram filling and imposes the small tax of needing to spawn processes, each with their own python interpreter, (as opposed to threads sharing a single interpreter) to achieve parallel data processing.

Looking to the future: with RNTuple, Feather (which already has a python-bound C++ implementation for this reason), and other similar high-throughput formats, it seems prudent to develop a GIL-friendly python packages for these HEP specific data sources.

  • Achieving this would require a whole new implementation of uproot (perhaps focusing only on array io at first) with a cython or C(++) backend
  • RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings

We should find people interested in pursuing and completing these critical tasks.

@lgray lgray changed the title [L. Gray] python bindings for rntuple, uproot-cpp [L. Gray] python bindings for rntuple, implementation of "uproot-cpp" Jul 4, 2023
@henryiii
Copy link
Collaborator

henryiii commented Jul 4, 2023

FYI, there's also the per-interpreter GIL being introduced in CPython 3.12. That would allow the launching of sub-interpreters each with their own GIL, but without creating separate processes. It doesn't have a Python API in 3.12, but there will be a PyPI package allowing this to be used from Python code. (The current draft of that module is at https://pypi.org/project/interpreters-3-12/)

Don't know if that changes anything here, but something to keep in mind.

@lgray
Copy link
Collaborator Author

lgray commented Jul 4, 2023

Thanks - that's good to know, but we'll be needing to deal with people using the previous interpreters for quite some time (basically until numba supports python 3.12).

@lgray lgray changed the title [L. Gray] python bindings for rntuple, implementation of "uproot-cpp" python bindings for rntuple, implementation of "uproot-cpp" Jul 4, 2023
@lgray lgray added topical-group Topic for discussion 2023 PyHEP.dev 2023 and removed topical-group Topic for discussion labels Jul 4, 2023
@jpivarski
Copy link
Collaborator

I've been in favor of a compiled-but-Python-friendly Uproot for some time, but it's always been too large of a task—this will require dedicated effort and coordination (because I'm assuming more than one developer).

Some questions to ask about such a thing:

  • Perhaps the compiled language should be Julia: UnROOT.jl already exists. Can its Python bindings be developed more?
  • For common use-cases, precompiled is better, and scientific-python/cookie gives us the options of Scikit-Build/pybind11 for C++ and maturin for Rust.
  • We also shouldn't disregard the possibility of doing it in Numba, since that can be partially compiled, partially not, and it has more affinity with Python types, as well as prior expertise among likely developers. In terms of JIT technology, it's no better or worse than the Julia option (it's all LLVM).

The main difference between these three options is what people you want to or are able to get together with this. Option 1 pulls Julia developers more into the Coffea world, option 2 is for people who like blank pages, starting from scratch1, and option 3 is for pulling it together quickly with the Python + Numba expertise that's already in this area.

Footnotes

  1. Unfortunately, I'm one of those people who likes to start things from scratch, and the Rust option appeals to me. But it's more important to pull together things that already have some momentum. If the end result of this is that the Python and Julia HEP tools get more interchangeable, that's probably the best long-term win.

@jpivarski
Copy link
Collaborator

Oh, I forgot one (or two) more bullet points:

  • Attempt to do TTree versus jumping right to RNTuple? Or maybe
  • Only cover NanoAOD-like TTree ("-like" means primitives and dynamically sized arrays of primitives) and RNTuple?

@lgray
Copy link
Collaborator Author

lgray commented Jul 5, 2023

If UnROOT can drop the gil then we're mostly good FWIW.

@jpivarski
Copy link
Collaborator

@tamasgal and @Moelf (Jerry will be attending): we should learn more about the scope of UnROOT's reading (and writing?) capabilities—what data types does it cover?—and how easy it would be to use it in Python. Can we, for instance, read NanoAOD-like TTrees into Awkward Arrays, possibly through Arrow, in a process controlled by Python?

@lgray
Copy link
Collaborator Author

lgray commented Jul 5, 2023

I talked this out a bit with @Moelf at CHEP and at zeroth order it seems possible but we both had a lot of questions about GIL-friendliness.

@henryiii
Copy link
Collaborator

henryiii commented Jul 5, 2023

I still think by the time something was worked out, 3.12 will be out, probably 3.12 compatible numba will be out, and you might be able to solve this with current uproot + intepreters-3-12, without rewriting much of anything. Might at least be worth testing with intepreters-3-12 and a 3.12 beta now (assuming you could make an interesting test without numba & maybe numpy).

@Moelf
Copy link
Collaborator

Moelf commented Jul 5, 2023

RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings

From my limited personal experience around people, I don't think this is happening soon, and regardless, a ~librntupleio.so won't come with writing capability, so I think we better just roll our own.


Regarding what UnROOT.jl can deliver, technology-wise I am optimistic about covering ~100% reading (at least for the features currently exist in RNTuple Spec). 1

  • uproot read RNTuple and then sending arrow batch to Julia is viable (requires 1x more allocation, no big deal if computing non-trivial)
  • UnROOT.jl doesn't have writing to .root files function and lacks infrastructure (for chunk, TKey allocation etc.) that uproot has.

From the analysis-adjacent user perspective, once we move to RNTuple (which will be ~100% compatible with arrow logically speaking), I see small need for writing out to .root files if the output flows downstream, in fact there are huge amount of Arrow ecosystem 2 3 that people can leverage if they do that.

Footnotes

  1. We already deal with complex RNTuple schema and nanoAOD converted by using ROOT

  2. https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/

  3. https://arrow.apache.org/docs/python/api/cuda.html

@lgray
Copy link
Collaborator Author

lgray commented Jul 5, 2023

Yeah - just switching to parquet / feather after reading the files in is perfectly viable IMO.

It's just a familiarity thing (and a convenience thing), people love TBrowser.

@tamasgal
Copy link

tamasgal commented Jul 6, 2023 via email

@jpivarski jpivarski added the topical-group Topic for discussion label Jul 7, 2023
@sudo-panda
Copy link
Collaborator

+1

@redeboer redeboer removed the topical-group Topic for discussion label Jul 11, 2023
@ianna
Copy link
Collaborator

ianna commented Jul 19, 2023

+1

1 similar comment
@nikoladze
Copy link
Collaborator

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023 PyHEP.dev 2023
Projects
None yet
Development

No branches or pull requests

9 participants