-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[skip-ci][ntuple] specifications.md: document RNTupleCardinality
#12127
[skip-ci][ntuple] specifications.md: document RNTupleCardinality
#12127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks! See comments inline.
tangentially, what's the fundamental reason we need a special interpretation of the same (Notice, by using |
The reason is indeed space savings. It's something like 1-4 byte per collection (uncompressed), so e.g. for nanoAODs with their ~25 collections, my assumption is that not duplicating the counts allows for one or a few additional real data columns. By the way, you don't necessarily need to look at the C++ type to distinguish between offsets and vector lengths. You can reason that a leaf field (
Not sure I see that this is a limitation of the |
yes, except that column information is not part of field information and (I assumed) the whole point of having field for types and column for data is so one can reason about types separately from primitive data types, which is only a property of the column. I thought about this more, which seems to be an outlier of the "leaf field" concept. For almost all leaf fields, you have a column which stores one of the primitive data types, and when you index, you get one element from that column. Maybe the correct way to think about it is it's an additional type of field (like vector, union, struct), except it's marked by a combination or (the "alias" in this context also seems more generalized than other comparable formats; IIUC, usually, alias means you cast the same bytes to a different type (for example, Switch is just a re-interpretation of a UInt64), but here you have to look beyond the current event and do arithmetics -- not a byte re-interpretation. This kind of generalized re-interpretation usually calls for a separate higher level variation, i.e. field types like |
Note that in general, a field (including leaf field) can have [0..$k$] columns. The first column is the "principal column" that is used to identify one element of that field. For instance, the |
Indeed! a leaf field with Another outlier we will run into someday is Overall, I think there's something powerful about dealing with type schemas without dealing with value information (columns). And I think RNTuple's type system has the potential and capacity (add more numbers to role) to make it happen. |
3d98de4
to
b8240d4
Compare
b8240d4
to
59dc9e8
Compare
I have reworded the definition for this field, let me know what you think about it. |
See also the former PR that introduces this projected field: #12008.
Checklist: