[skip-ci][ntuple] specifications.md: document `RNTupleCardinality` #12127

jalopezg-git · 2023-01-27T10:26:46Z

See also the former PR that introduces this projected field: #12008.

Checklist:

updated the docs (if necessary)

jblomer

Many thanks! See comments inline.

tree/ntuple/v7/doc/specifications.md

Moelf · 2023-01-27T13:58:36Z

tangentially, what's the fundamental reason we need a special interpretation of the same offset array instead of just having a new column for tracking the counter? can we really not spare one more Uint16 or even UInt8 per event to track it?

(Notice, by using Cardinality, the total counter is restricted to typemax(UInt32) = 4294967295 -> a restriction on average counter * number of events in cluster)

jblomer · 2023-01-28T12:16:29Z

tangentially, what's the fundamental reason we need a special interpretation of the same offset array instead of just having a new column for tracking the counter? can we really not spare one more Uint16 or even UInt8 per event to track it?

The reason is indeed space savings. It's something like 1-4 byte per collection (uncompressed), so e.g. for nanoAODs with their ~25 collections, my assumption is that not duplicating the counts allows for one or a few additional real data columns.

By the way, you don't necessarily need to look at the C++ type to distinguish between offsets and vector lengths. You can reason that a leaf field (role == 0) backed by an [Split]Index[32|64] column should be interpreted as vector lengths and for a collection field (role == 1) it's offsets.

(Notice, by using Cardinality, the total counter is restricted to typemax(UInt32) = 4294967295 -> a restriction on average counter * number of events in cluster)

Not sure I see that this is a limitation of the RNTupleCardinality type. It is currently 32bit limited due to the use of Index32 columns, but with [Split]Index64 columns vector lengths can be 64bit, too.

Moelf · 2023-01-28T19:28:25Z

By the way, you don't necessarily need to look at the C++ type to distinguish between offsets and vector lengths. You can reason that a leaf field (role == 0) backed by an [Split]Index[32|64] column

yes, except that column information is not part of field information and (I assumed) the whole point of having field for types and column for data is so one can reason about types separately from primitive data types, which is only a property of the column.

I thought about this more, which seems to be an outlier of the "leaf field" concept. For almost all leaf fields, you have a column which stores one of the primitive data types, and when you index, you get one element from that column.

Maybe the correct way to think about it is it's an additional type of field (like vector, union, struct), except it's marked by a combination or role and type instead of the usual "role is enough".

(the "alias" in this context also seems more generalized than other comparable formats; IIUC, usually, alias means you cast the same bytes to a different type (for example, Switch is just a re-interpretation of a UInt64), but here you have to look beyond the current event and do arithmetics -- not a byte re-interpretation. This kind of generalized re-interpretation usually calls for a separate higher level variation, i.e. field types like vector where we establish "offset" but the "offset" itself is just a plain array)

jblomer · 2023-01-29T12:35:45Z

I thought about this more, which seems to be an outlier of the "leaf field" concept. For almost all leaf fields, you have a column which stores one of the primitive data types, and when you index, you get one element from that column.

Note that in general, a field (including leaf field) can have [0..$k$] columns. The first column is the "principal column" that is used to identify one element of that field. For instance, the std::string leaf field has an index and a char column.

Moelf · 2023-01-29T13:21:26Z

Indeed! a leaf field with std::string is special because it's the only thing that "splits" in column records (i.e. the string is basically a jagged vector of bytes, but a role=1 jagged vector splits in field records).

Another outlier we will run into someday is _collection0 with only one child-field, this would look identical to a vector (because role=1 and each has only 1 child-field).

Overall, I think there's something powerful about dealing with type schemas without dealing with value information (columns). And I think RNTuple's type system has the potential and capacity (add more numbers to role) to make it happen.

See also: root-project#12008

jalopezg-git · 2023-01-30T12:57:29Z

I have reworded the definition for this field, let me know what you think about it.

jalopezg-git added in:Documentation in:RNTuple labels Jan 27, 2023

jalopezg-git requested review from jblomer and pcanal January 27, 2023 10:26

jalopezg-git self-assigned this Jan 27, 2023

jblomer requested changes Jan 27, 2023

View reviewed changes

tree/ntuple/v7/doc/specifications.md Outdated Show resolved Hide resolved

tree/ntuple/v7/doc/specifications.md Outdated Show resolved Hide resolved

Moelf mentioned this pull request Jan 28, 2023

handle RNTupleCardinality field and add test file JuliaHEP/UnROOT.jl#209

Merged

jalopezg-git force-pushed the ntuple-spec-rntuplecardinality branch from 3d98de4 to b8240d4 Compare January 30, 2023 12:37

[skip-ci][ntuple] specifications.md: document RNTupleCardinality

59dc9e8

See also: root-project#12008

jalopezg-git force-pushed the ntuple-spec-rntuplecardinality branch from b8240d4 to 59dc9e8 Compare January 30, 2023 12:53

jalopezg-git requested a review from jblomer January 30, 2023 12:54

jblomer approved these changes Feb 2, 2023

View reviewed changes

jalopezg-git merged commit 24ea0de into root-project:master Feb 2, 2023

jalopezg-git deleted the ntuple-spec-rntuplecardinality branch February 2, 2023 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[skip-ci][ntuple] specifications.md: document `RNTupleCardinality` #12127

[skip-ci][ntuple] specifications.md: document `RNTupleCardinality` #12127

jalopezg-git commented Jan 27, 2023

jblomer left a comment

Moelf commented Jan 27, 2023 •

edited

Loading

jblomer commented Jan 28, 2023 •

edited

Loading

Moelf commented Jan 28, 2023 •

edited

Loading

jblomer commented Jan 29, 2023

Moelf commented Jan 29, 2023 •

edited

Loading

jalopezg-git commented Jan 30, 2023

[skip-ci][ntuple] specifications.md: document RNTupleCardinality #12127

[skip-ci][ntuple] specifications.md: document RNTupleCardinality #12127

Conversation

jalopezg-git commented Jan 27, 2023

Checklist:

jblomer left a comment

Choose a reason for hiding this comment

Moelf commented Jan 27, 2023 • edited Loading

jblomer commented Jan 28, 2023 • edited Loading

Moelf commented Jan 28, 2023 • edited Loading

jblomer commented Jan 29, 2023

Moelf commented Jan 29, 2023 • edited Loading

jalopezg-git commented Jan 30, 2023

[skip-ci][ntuple] specifications.md: document `RNTupleCardinality` #12127

[skip-ci][ntuple] specifications.md: document `RNTupleCardinality` #12127

Moelf commented Jan 27, 2023 •

edited

Loading

jblomer commented Jan 28, 2023 •

edited

Loading

Moelf commented Jan 28, 2023 •

edited

Loading

Moelf commented Jan 29, 2023 •

edited

Loading