-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working with LazyTree as a DataFrame #73
Comments
actually, if you do julia> @time @from evt in t2 begin
@where length(evt.Jet_pt) > 6
@let njets=length(evt.Jet_pt)
@let njets40=sum(evt.Jet_pt.>40)
@select {njets=njets, njets40, evt.MET_pt}
@collect DataFrame
end
0.322372 seconds (1.02 M allocations: 57.209 MiB, 99.82% compilation time)
831×3 DataFrame
Row │ njets njets40 MET_pt
│ Int64 Int64 Float32
─────┼──────────────────────────
1 │ 7 1 28.2933
2 │ 7 4 38.6646
3 │ 7 6 75.2459
4 │ 8 4 49.9447
5 │ 9 3 80.1362 |
to show that julia> @time const mytree = LazyTree(ROOTFile("./test/samples/NanoAODv5_sample.root"), "Events", r"nMuon");
0.067813 seconds (296.65 k allocations: 25.716 MiB)
julia> @time DataFrame(mytree; copycols=false);
0.000018 seconds (20 allocations: 1.375 KiB)
julia> @time df = filter(evt->evt.nMuon==2, DataFrame(mytree; copycols=false));
0.108670 seconds (383.77 k allocations: 19.429 MiB, 17.67% gc time, 98.96% compilation time) we could probably deligate julia> inner = getfield(t, :treetable);
julia> filter(evt->evt.nMuon==2, inner)
Table with 2 columns and 175 rows:
Jet_nMuons nMuon
┌─────────────────────────────────────────────────────────────
1 │ Int32[0, 1, 0, 0, 1, 0, 0] 2
2 │ Int32[0, 0, 1, 0, 0, 0, 0, 0, 0] 2
3 │ Int32[0, 0, 0, 1, 0, 1, 0] 2 the second one simply goes through Julia's base filter https://github.com/JuliaLang/julia/blob/1bbba21aa258a99d1ecf1168d72d64cb402fd054/base/array.jl#L2523 first pass collecting index mask and then index the original table with mask. Not the most efficient way, but works. |
a bit tangential here, actually we don't need I haven't investigated what's the performance pit fall once you wrap it inside a |
Thanks @Moelf for the various hints. The |
For filling histogram, likely you will have more than a few histograms to fill anyway so jam all the code into a columnar command is hard to maintain later. https://github.com/Moelf/FHist.jl A loop will work just fine: h = Hist1D(Int; bins=-3:0.5:3)
for evt in mytree
if evt.nMuon == 2
push!(h, count(evt.Jet_nMuons))
end
end If you want to do something "aside" in the middle of query, you probably need to ask Query.jl people, since most of the DataFrame community (julia or python) doesn't have this need I guess. Also, I believe that the |
If you have a particular RDataFrame use-case in mind, that is nice AND scalable such that you can do a full-blown analysis with just RDataFrame query-y commands, I'm happy to match the semantics with some utility macros. Otherwise I feel like RDataFrame is just for quick check (which is probably easier to do in Julia[1] with all DataFrame ecosystem available), and eventually one starts write the analysis loop anyway. [1]: albeit slower by a bit, but does it matter in interactive exploration stage of an analysis? |
First I have the personal challenge to fill an histogram with all the events in the ROOT file (~65M) without writing explicitly the event loop. The following code blows-up if I use the full file: using UnROOT, Query, FHist, StatsBase, DataFrames
function invariantMass(pt, eta, phi, mass)
x, y, z = pt .* cos.(phi), pt .* sin.(phi), pt .* sinh.(eta)
e = sqrt.( x.^2 + y.^2 + z.^2 + mass.^2)
sqrt(sum(e)^2 - sum(x)^2 - sum(y)^2 - sum(z)^2)
end
events = LazyTree(ROOTFile("/eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"), "Events")
@time df = events |>
#@take(1000000) |>
@filter(_.nMuon == 2) |>
@filter(_.Muon_charge[1] != _.Muon_charge[2]) |>
@map({inv_mass = invariantMass(_.Muon_pt, _.Muon_eta, _.Muon_phi, _.Muon_mass)}) |>
DataFrame
h = Hist1D(log10.(df.inv_mass), -1.:0.005:2.5) Second, we are developing RDataFrame exactly for the purpose of not having the explicit event loop and ensuring that the different selections, reductions (filling histograms), etc. are all executed within a single pass, thus optimizing I/O, and with the possibility in the back of further optimization, parallel execution, data caching, etc. I'll provide you with an example of some complexity to see how this could be written in Julia. |
yes, your example makes a copy at the end, because the
I doubt real world analysis expressed in RDataFrame operation chain will be less complex than a loop in Julia. Can you for example show how to find the pair of best Z-candidate lepton pair among both muons and electrons (bonus: if each of them has to pass their own family specific quality cut) and fill a histogram that shows a Z-peak? I find it difficult to imagine how to do it without loops. The point is that for-loop in Julia is trivial: no binding variables etc. and users can parallel them however they like, explicitly. |
btw there's a CMS NanoAOD in the |
You can download it from here: http://opendata.web.cern.ch/record/12341 |
I would do your example this way: julia> using UnROOT, LVCyl, FHist
julia> const events = LazyTree(ROOTFile("Run2012BC_DoubleMuParked_Muons.root"), "Events");
julia> const h = Hist1D(Int; bins = -1.:0.005:2.5);
@time for evt in events
evt.nMuon != 2 && continue
sum(evt.Muon_charge) != 0 && continue
lv1, lv2 = LorentzVectorCyl.(evt.Muon_pt, evt.Muon_eta, evt.Muon_phi, evt.Muon_mass)
inv_mass = fast_mass(lv1, lv2)
#we know we're only filling from one thread
unsafe_push!(h, log10(inv_mass))
end
24.747783 seconds (24.62 M allocations: 31.011 GiB, 2.13% gc time) #note: this is cumulative allocation LVCyl from https://github.com/JuliaHEP/LVCyl.jl, which also provides the julia> length(events)
61540413
julia> ans / 24.74
2.487486378334681e6 2.5 |
A couple of examples with some complexity using RDataFrame and accessing Open Data. |
Those are way too long and I can't quite figure out where does some functions in string comes from for example Is it not possible to have a single-objective example? Each of them have too many histograms weaved in, making exact comparison will be impossible because too many non-event-loop related cost. |
Indeed, the pure Julia equivalent would be completely different. In the example above, there are many high-level cuts (btw. the string-evaluation-style coding is really awful for clarity) which do multiple iterations over and over again. In Julia, you'd likely do it straight forward in a simple loop. Just to show a dumb example; people doing data reduction in Python tend to work with Pandas >>> df = df[df.a < 10]
>>> df = df[np.abs(df.b < 1) & (np.c > 0)]
>>> df = df[df.a + df.b < 0.4]
>>> ... in this example, 10 temporary datasets are created and there are multiple loops over the same elements over and over. Depending on the data size, this can cause enormous memory overhead and heavy calculations lead to wasted CPU resources too. In Julia you'd most likely iterate once and build up your final dataset element-wise, step-by-step. Clear, concise and with a low memory footprint and efficient CPU usage. ...and the best thing is: you don't have to learn fancy slicing tricks to circumvent the above mentioned issues, you just code your algorithm straight forward. In my opinion, it would be an easy exercise to work through https://root.cern/doc/master/df103__NanoAODHiggsAnalysis_8py.html and implement it in Julia, but it requires time, which probably should be invested by someone who needs it |
The header file is https://github.com/root-project/root/blob/master/tutorials/dataframe/df103_NanoAODHiggsAnalysis_python.h
I understand it is quite a lot of work, but comparing the performance, code clarity, debugability, concurrency, etc. of an analysis like this from the ROOT input file to the final set of histograms it is worth to having convincing elements to the HEP community of the advantages of Julia versus C++/Python. |
In RDataFrame all the actions (definitions, filters, etc.) are lazy. There is indeed a single event loop over the full analysis. I do agree that the string-evaluation-style coding is really awful for clarity. This is the main reason why Julia could be very attractive is a reasonable performance is preserved.
No, if
Yes, this is clear. But what you would like is to be able to partition the loop over many cores, workers, nodes, etc. in a fairly easy or transparent manner.
I do agree, int requires work and it should be invested somehow. |
My fear is that due to the auxiliary code around making plots and histogram and name things, it won't be clear to readers because it's too long to hold in (our) short memory. To make a constructive example, I think comparing the double loop: would be clear. And then compare two filters, and comparing making two histograms. Basically, comparing 20 filters and making 20 histograms doesn't add to the comparison, and IMHO, only making examples impossible to comprehend unless users already know Julia |
Ah OK, sorry for my ignorance. This makes the comparison even more interesting |
Yes, I do agree it is good to start with. |
ok let me try to write a blog-like article and compare a few "kernel" functions like that and build up to a mini full analysis with two cuts and filling two histograms |
I'm writing the said document already and want to put some key observation here for posterity without referring to the full document. This is based on example of: https://root.cern/doc/master/df103__NanoAODHiggsAnalysis_8py_source.html, the function defined on line 145. If you look closely, line 151 and 159 correspond to two functions in: https://root.cern/doc/master/df103__NanoAODHiggsAnalysis__python_8h_source.html header file. In this header file, pay attention to line 41 and line 71. We see that the invariant mass of the pairs of muons are computed multiple times. This is potentially expensive and we can imagine the same situation happening for more expensive variable computation. Now, there's You may also think we can merge the two "kernel" functions into one, the problem is that now you need to merge everything in between too, because these two |
I am not sure I am following, although I do see that there is some recalculation that could be perhaps saved between the filtering steps. The |
indeed, but consider this pseudo code: def filter_kernel1():
condition1 = calculation...
if condition1:
variable1 = ...
else:
return false
return true
def filter_kernel2():
calculation_with_early_exit...
return true
def filter_kernel3():
#re-use variable1 here means make condition1, and variable1 two `.Define()`s
#or merging 1,2,3 into a big blob and in My over all impression is that, there seem to be not much gain in terms of clarity (string-based, split in C++ and python) and functionality/performance (users still need to know the order of filters, not to repeat calculation and split into |
btw, since: mutable struct LazyBranch{T,J,B} <: AbstractVector{T}
f::ROOTFile
... converting a lazy tree it to a |
I'm new to Julia wanting to see its potential for HEP computing. For this, I am playing with a TTree with UnROOT to see how this compares with equivalent job with ROOT/RDataFrame.
I understand that a
LazyTree
is not aDataFrame
, since in general you cannot not fit in memory the fullTTree
, however it provides the basic functionality of anAbstractDataFrame
. Naively I thought that I could use initially aLazyTree
and then once it is heavily reduced to be converted to aDataFrame
. But this seems not to be working. Perhaps you can give me some hints.The code I am trying to execute is as follows:
and the error I get is:
If instead of passing a
LazyTree
I construct aDataFrame
, then it works nicely. However it means, I think, that I have copied in memory the full TTree.The text was updated successfully, but these errors were encountered: