Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with LazyTree as a DataFrame #73

Closed
peremato opened this issue Aug 27, 2021 · 24 comments
Closed

Working with LazyTree as a DataFrame #73

peremato opened this issue Aug 27, 2021 · 24 comments

Comments

@peremato
Copy link
Member

I'm new to Julia wanting to see its potential for HEP computing. For this, I am playing with a TTree with UnROOT to see how this compares with equivalent job with ROOT/RDataFrame.
I understand that aLazyTree is not a DataFrame, since in general you cannot not fit in memory the full TTree, however it provides the basic functionality of an AbstractDataFrame. Naively I thought that I could use initially a LazyTree and then once it is heavily reduced to be converted to a DataFrame. But this seems not to be working. Perhaps you can give me some hints.
The code I am trying to execute is as follows:

using UnROOT
f = ROOTFile("/eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root")
events = LazyTree(f, "Events");
df = filter([:nMuon,:Muon_charge] => (n,c) -> n == 2 && c[1] != c[2], events[1:1000000])

and the error I get is:

MethodError: no method matching getindex(::LazyTree{TypedTables.Table{NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, UInt32, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}, 1, NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{UInt32}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}}}}, ::BitVector, ::Colon)
Closest candidates are:
  getindex(::AbstractDataFrame, ::Integer, ::Colon) at /eos/user/m/mato/.julia/packages/DataFrames/vuMM8/src/dataframerow/dataframerow.jl:210
  getindex(::AbstractDataFrame, ::Integer, ::Union{Colon, Regex, AbstractVector{T} where T, All, Between, Cols, InvertedIndex}) at /eos/user/m/mato/.julia/packages/DataFrames/vuMM8/src/dataframerow/dataframerow.jl:208
  getindex(::AbstractDataFrame, ::CartesianIndex{2}) at /eos/user/m/mato/.julia/packages/DataFrames/vuMM8/src/other/broadcasting.jl:3
  ...

Stacktrace:
 [1] #filter#88
   @ ~.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1040 [inlined]
 [2] filter(::Pair{Vector{Symbol}, var"#5#6"}, df::LazyTree{TypedTables.Table{NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, UInt32, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}, 1, NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{UInt32}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}}}})
   @ DataFrames ~.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1035
 [3] macro expansion
   @ In[5]:3 [inlined]
 [4] top-level scope
   @ timing.jl:210
 [5] eval
   @ ./boot.jl:360 [inlined]
 [6] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

If instead of passing a LazyTree I construct a DataFrame, then it works nicely. However it means, I think, that I have copied in memory the full TTree.

df = filter([:nMuon,:Muon_charge] => (n,c) -> n == 2 && c[1] != c[2], DataFrame(events[1:1000000]))
@Moelf
Copy link
Member

Moelf commented Aug 27, 2021

actually, if you do copycols=false, it's lazy DataFrame I believe, but notice the [1:10000] is making a concrete table first. @aminnj has showed me that packages in the eco-system just works, such as https://github.com/queryverse/Query.jl

julia> @time @from evt in t2 begin
         @where length(evt.Jet_pt) > 6
         @let njets=length(evt.Jet_pt)
         @let njets40=sum(evt.Jet_pt.>40)
         @select {njets=njets, njets40, evt.MET_pt}
         @collect DataFrame
       end
  0.322372 seconds (1.02 M allocations: 57.209 MiB, 99.82% compilation time)
831×3 DataFrame
 Row │ njets  njets40  MET_pt
     │ Int64  Int64    Float32
─────┼──────────────────────────
   17        1   28.2933
   27        4   38.6646
   37        6   75.2459
   48        4   49.9447
   59        3   80.1362

@Moelf
Copy link
Member

Moelf commented Aug 27, 2021

to show that DataFrame can be lazy just fine:

julia> @time const mytree = LazyTree(ROOTFile("./test/samples/NanoAODv5_sample.root"), "Events", r"nMuon");
  0.067813 seconds (296.65 k allocations: 25.716 MiB)

julia> @time DataFrame(mytree; copycols=false);
  0.000018 seconds (20 allocations: 1.375 KiB)

julia> @time df = filter(evt->evt.nMuon==2, DataFrame(mytree; copycols=false));
  0.108670 seconds (383.77 k allocations: 19.429 MiB, 17.67% gc time, 98.96% compilation time)

we could probably deligate filter() to the inner table:

julia> inner = getfield(t, :treetable);

julia> filter(evt->evt.nMuon==2, inner)
Table with 2 columns and 175 rows:
      Jet_nMuons                                             nMuon
    ┌─────────────────────────────────────────────────────────────
 1  │ Int32[0, 1, 0, 0, 1, 0, 0]                             2
 2  │ Int32[0, 0, 1, 0, 0, 0, 0, 0, 0]                       2
 3  │ Int32[0, 0, 0, 1, 0, 1, 0]                             2

the second one simply goes through Julia's base filter https://github.com/JuliaLang/julia/blob/1bbba21aa258a99d1ecf1168d72d64cb402fd054/base/array.jl#L2523 first pass collecting index mask and then index the original table with mask. Not the most efficient way, but works.

@Moelf
Copy link
Member

Moelf commented Aug 27, 2021

however it provides the basic functionality of an AbstractDataFrame

a bit tangential here, actually we don't need DataFrames.jl for any "real" functionality, mainly the printing and some interface I got tired hunting down. Later we could totally drop DataFrames.jl dependency since you can always DataFrame(mytree; copycols=false) to get a DataFrame without blowing up RAM.

I haven't investigated what's the performance pit fall once you wrap it inside a DataFrame.

@peremato
Copy link
Member Author

Thanks @Moelf for the various hints. The copycols=false is very useful. I'll have a look at Query.jl. It looks very interesting, although I do not know how to do the reduction (aka filling some histograms) without having to materialize a full Array or DataFrame with all the values that I would like to histogram.

@Moelf
Copy link
Member

Moelf commented Aug 27, 2021

For filling histogram, likely you will have more than a few histograms to fill anyway so jam all the code into a columnar command is hard to maintain later. https://github.com/Moelf/FHist.jl A loop will work just fine:

h = Hist1D(Int; bins=-3:0.5:3)

for evt in mytree
    if evt.nMuon == 2
        push!(h, count(evt.Jet_nMuons))
    end
end

If you want to do something "aside" in the middle of query, you probably need to ask Query.jl people, since most of the DataFrame community (julia or python) doesn't have this need I guess.

Also, I believe that the LazyBranch from mytree.nMuon works with histogram without blowing up memory, but that's probably not too useful anyway.

@Moelf
Copy link
Member

Moelf commented Aug 27, 2021

If you have a particular RDataFrame use-case in mind, that is nice AND scalable such that you can do a full-blown analysis with just RDataFrame query-y commands, I'm happy to match the semantics with some utility macros. Otherwise I feel like RDataFrame is just for quick check (which is probably easier to do in Julia[1] with all DataFrame ecosystem available), and eventually one starts write the analysis loop anyway.

[1]: albeit slower by a bit, but does it matter in interactive exploration stage of an analysis?

@peremato
Copy link
Member Author

First I have the personal challenge to fill an histogram with all the events in the ROOT file (~65M) without writing explicitly the event loop. The following code blows-up if I use the full file:

using UnROOT, Query, FHist, StatsBase, DataFrames

function invariantMass(pt, eta, phi, mass)
   x, y, z = pt .* cos.(phi), pt .* sin.(phi), pt .* sinh.(eta)
   e = sqrt.( x.^2 + y.^2 + z.^2 + mass.^2)
   sqrt(sum(e)^2 - sum(x)^2 - sum(y)^2 - sum(z)^2) 
end

events = LazyTree(ROOTFile("/eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"), "Events")

@time df = events |>
    #@take(1000000) |>
    @filter(_.nMuon == 2) |>
    @filter(_.Muon_charge[1] != _.Muon_charge[2]) |>
    @map({inv_mass = invariantMass(_.Muon_pt, _.Muon_eta, _.Muon_phi, _.Muon_mass)}) |>
    DataFrame
h = Hist1D(log10.(df.inv_mass), -1.:0.005:2.5)

Second, we are developing RDataFrame exactly for the purpose of not having the explicit event loop and ensuring that the different selections, reductions (filling histograms), etc. are all executed within a single pass, thus optimizing I/O, and with the possibility in the back of further optimization, parallel execution, data caching, etc. I'll provide you with an example of some complexity to see how this could be written in Julia.

@Moelf
Copy link
Member

Moelf commented Aug 27, 2021

yes, your example makes a copy at the end, because the Query produces a collection of NamedTuples.

not having the explicit event loop

I doubt real world analysis expressed in RDataFrame operation chain will be less complex than a loop in Julia. Can you for example show how to find the pair of best Z-candidate lepton pair among both muons and electrons (bonus: if each of them has to pass their own family specific quality cut) and fill a histogram that shows a Z-peak? I find it difficult to imagine how to do it without loops.

The point is that for-loop in Julia is trivial: no binding variables etc. and users can parallel them however they like, explicitly.

@Moelf
Copy link
Member

Moelf commented Aug 27, 2021

btw there's a CMS NanoAOD in the ./test/samples of this repo, it's easier to have a common test sample so we copy-paste and run. (accessing opendata turns out to be more complicated than I thought)

@peremato
Copy link
Member Author

You can download it from here: http://opendata.web.cern.ch/record/12341
I am using it from SWAN with Bleeding Edge software stack

@Moelf
Copy link
Member

Moelf commented Aug 27, 2021

I would do your example this way:

julia> using UnROOT, LVCyl, FHist

julia> const events = LazyTree(ROOTFile("Run2012BC_DoubleMuParked_Muons.root"), "Events");

julia> const h = Hist1D(Int; bins = -1.:0.005:2.5);

@time for evt in events
           evt.nMuon != 2 && continue
           sum(evt.Muon_charge) != 0 && continue
           lv1, lv2 = LorentzVectorCyl.(evt.Muon_pt, evt.Muon_eta, evt.Muon_phi, evt.Muon_mass)
           inv_mass = fast_mass(lv1, lv2)
           #we know we're only filling from one thread
           unsafe_push!(h, log10(inv_mass))
       end
 24.747783 seconds (24.62 M allocations: 31.011 GiB, 2.13% gc time) #note: this is cumulative allocation

LVCyl from https://github.com/JuliaHEP/LVCyl.jl, which also provides the fast_mass

julia> length(events)
61540413

julia> ans / 24.74
2.487486378334681e6

2.5 MHz isn't too bad I hope.

@peremato
Copy link
Member Author

A couple of examples with some complexity using RDataFrame and accessing Open Data.
https://github.com/cms-opendata-analyses/HiggsTauTauNanoAODOutreachAnalysis
https://root.cern/doc/master/df103__NanoAODHiggsAnalysis_8py.html
It would be nice to compare them with pure Julia equivalent.

@Moelf
Copy link
Member

Moelf commented Aug 30, 2021

Those are way too long and I can't quite figure out where does some functions in string comes from for example pt_cuts() in the second link (.h file somewhere?).

Is it not possible to have a single-objective example? Each of them have too many histograms weaved in, making exact comparison will be impossible because too many non-event-loop related cost.

@tamasgal
Copy link
Member

Indeed, the pure Julia equivalent would be completely different. In the example above, there are many high-level cuts (btw. the string-evaluation-style coding is really awful for clarity) which do multiple iterations over and over again. In Julia, you'd likely do it straight forward in a simple loop.

Just to show a dumb example; people doing data reduction in Python tend to work with Pandas DataFrames à la:

>>> df = df[df.a < 10]
>>> df = df[np.abs(df.b < 1) & (np.c > 0)]
>>> df = df[df.a + df.b < 0.4]
>>> ...

in this example, 10 temporary datasets are created and there are multiple loops over the same elements over and over. Depending on the data size, this can cause enormous memory overhead and heavy calculations lead to wasted CPU resources too.

In Julia you'd most likely iterate once and build up your final dataset element-wise, step-by-step. Clear, concise and with a low memory footprint and efficient CPU usage.

...and the best thing is: you don't have to learn fancy slicing tricks to circumvent the above mentioned issues, you just code your algorithm straight forward.

In my opinion, it would be an easy exercise to work through https://root.cern/doc/master/df103__NanoAODHiggsAnalysis_8py.html and implement it in Julia, but it requires time, which probably should be invested by someone who needs it ;)

@peremato
Copy link
Member Author

Those are way too long and I can't quite figure out where does some functions in string comes from for example pt_cuts() in the second link (.h file somewhere?).

The header file is https://github.com/root-project/root/blob/master/tutorials/dataframe/df103_NanoAODHiggsAnalysis_python.h

Is it not possible to have a single-objective example? Each of them have too many histograms weaved in, making exact comparison will be impossible because too many non-event-loop related cost.

I understand it is quite a lot of work, but comparing the performance, code clarity, debugability, concurrency, etc. of an analysis like this from the ROOT input file to the final set of histograms it is worth to having convincing elements to the HEP community of the advantages of Julia versus C++/Python.

@peremato
Copy link
Member Author

Indeed, the pure Julia equivalent would be completely different. In the example above, there are many high-level cuts (btw. the string-evaluation-style coding is really awful for clarity) which do multiple iterations over and over again. In Julia, you'd likely do it straight forward in a simple loop.

In RDataFrame all the actions (definitions, filters, etc.) are lazy. There is indeed a single event loop over the full analysis. I do agree that the string-evaluation-style coding is really awful for clarity. This is the main reason why Julia could be very attractive is a reasonable performance is preserved.

Just to show a dumb example; people doing data reduction in Python tend to work with Pandas DataFrames à la:

>>> df = df[df.a < 10]
>>> df = df[np.abs(df.b < 1) & (np.c > 0)]
>>> df = df[df.a + df.b < 0.4]
>>> ...

in this example, 10 temporary datasets are created and there are multiple loops over the same elements over and over. Depending on the data size, this can cause enormous memory overhead and heavy calculations lead to wasted CPU resources too.

No, if df is a RDataFrame, then there is no many loops nor many instantiations of temporary datasets. Single loop and no large temporaries.

In Julia you'd most likely iterate once and build up your final dataset element-wise, step-by-step. Clear, concise and with a low memory footprint and efficient CPU usage.
...and the best thing is: you don't have to learn fancy slicing tricks to circumvent the above mentioned issues, you just code your algorithm straight forward.

Yes, this is clear. But what you would like is to be able to partition the loop over many cores, workers, nodes, etc. in a fairly easy or transparent manner.

In my opinion, it would be an easy exercise to work through https://root.cern/doc/master/df103__NanoAODHiggsAnalysis_8py.html and implement it in Julia, but it requires time, which probably should be invested by someone who needs it ;)

I do agree, int requires work and it should be invested somehow.

@Moelf
Copy link
Member

Moelf commented Aug 30, 2021

My fear is that due to the auxiliary code around making plots and histogram and name things, it won't be clear to readers because it's too long to hold in (our) short memory.

To make a constructive example, I think comparing the double loop:
https://root.cern/doc/master/df103__NanoAODHiggsAnalysis__python_8h_source.html#l00026

would be clear. And then compare two filters, and comparing making two histograms. Basically, comparing 20 filters and making 20 histograms doesn't add to the comparison, and IMHO, only making examples impossible to comprehend unless users already know Julia

@tamasgal
Copy link
Member

No, if df is a RDataFrame, then there is no many loops nor many instantiations of temporary datasets. Single loop and no large temporaries.

Ah OK, sorry for my ignorance. This makes the comparison even more interesting :)

@peremato
Copy link
Member Author

To make a constructive example, I think comparing the double loop:
https://root.cern/doc/master/df103__NanoAODHiggsAnalysis__python_8h_source.html#l00026
would be clear. And then compare two filters, and comparing making two histograms.

Yes, I do agree it is good to start with.

@Moelf
Copy link
Member

Moelf commented Aug 30, 2021

ok let me try to write a blog-like article and compare a few "kernel" functions like that and build up to a mini full analysis with two cuts and filling two histograms

@Moelf
Copy link
Member

Moelf commented Sep 1, 2021

I'm writing the said document already and want to put some key observation here for posterity without referring to the full document. This is based on example of: https://root.cern/doc/master/df103__NanoAODHiggsAnalysis_8py_source.html, the function defined on line 145. If you look closely, line 151 and 159 correspond to two functions in: https://root.cern/doc/master/df103__NanoAODHiggsAnalysis__python_8h_source.html header file. In this header file, pay attention to line 41 and line 71.

We see that the invariant mass of the pairs of muons are computed multiple times. This is potentially expensive and we can imagine the same situation happening for more expensive variable computation. Now, there's .Define(), but it is awkward to use especially if your expensive variable is optional and changes from event to event (e.g. not sure which two leptons make up the Z).

You may also think we can merge the two "kernel" functions into one, the problem is that now you need to merge everything in between too, because these two .Filters() are not called right after another. => basically you ends up writing a bigger and bigger (would be) for-loop body when you merge kernels for performance reason.

@peremato
Copy link
Member Author

peremato commented Sep 2, 2021

I am not sure I am following, although I do see that there is some recalculation that could be perhaps saved between the filtering steps. The .Define() adds virtually a new column to the dataframe and works event by event. It is equivalent to declare and set a temporary variable in the event loop, which can be used in a subsequent filter expression. Instead of 'defining' the indices of the leptons to form the Zs you could also return in addition the masses between them. In C++ is not as easy as in Julia and you would need to return a std::pair<>. In my opinion this is not to important here at this stage.

@Moelf
Copy link
Member

Moelf commented Sep 2, 2021

It is equivalent to declare and set a temporary variable in the event loop, which can be used in a subsequent filter expression

indeed, but consider this pseudo code:

def filter_kernel1():
	condition1 = calculation...
	if condition1:
		variable1 = ...
	else:
		return false
	return true

def filter_kernel2():
	calculation_with_early_exit...
	return true

def filter_kernel3():
	#re-use variable1 here means make condition1, and variable1 two `.Define()`s
    #or merging 1,2,3 into a big blob

and in filter_kernel3, depending on if variable1 exists, you want to re-use it. And there are other Filters between these two. Now the difficulty is you need to "break-up" every kernel at places where you have new variables and conditionals (manual SSA?), or, you can write them in one kernel function, but only if you also write everything logically between kernel1 and kernel3 into the same function as well.

My over all impression is that, there seem to be not much gain in terms of clarity (string-based, split in C++ and python) and functionality/performance (users still need to know the order of filters, not to repeat calculation and split into .Define() for intermediate variables and break up or merge kernel functions accordingly), when compared to a for-loop that is as readable as pseudo code.

@Moelf
Copy link
Member

Moelf commented Nov 12, 2022

btw, since:

mutable struct LazyBranch{T,J,B} <: AbstractVector{T}
    f::ROOTFile
...

converting a lazy tree it to a DataFrame(; copycols = false) should just work, we can maybe have integration test one day.

@Moelf Moelf closed this as completed Nov 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants