Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[skip-ci][ntuple] specifications.md: document RNTupleCardinality #12127

Conversation

jalopezg-git
Copy link
Collaborator

See also the former PR that introduces this projected field: #12008.

Checklist:

  • updated the docs (if necessary)

Copy link
Contributor

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks! See comments inline.

tree/ntuple/v7/doc/specifications.md Outdated Show resolved Hide resolved
tree/ntuple/v7/doc/specifications.md Outdated Show resolved Hide resolved
@Moelf
Copy link
Contributor

Moelf commented Jan 27, 2023

tangentially, what's the fundamental reason we need a special interpretation of the same offset array instead of just having a new column for tracking the counter? can we really not spare one more Uint16 or even UInt8 per event to track it?

(Notice, by using Cardinality, the total counter is restricted to typemax(UInt32) = 4294967295 -> a restriction on average counter * number of events in cluster)

@jblomer
Copy link
Contributor

jblomer commented Jan 28, 2023

tangentially, what's the fundamental reason we need a special interpretation of the same offset array instead of just having a new column for tracking the counter? can we really not spare one more Uint16 or even UInt8 per event to track it?

The reason is indeed space savings. It's something like 1-4 byte per collection (uncompressed), so e.g. for nanoAODs with their ~25 collections, my assumption is that not duplicating the counts allows for one or a few additional real data columns.

By the way, you don't necessarily need to look at the C++ type to distinguish between offsets and vector lengths. You can reason that a leaf field (role == 0) backed by an [Split]Index[32|64] column should be interpreted as vector lengths and for a collection field (role == 1) it's offsets.

(Notice, by using Cardinality, the total counter is restricted to typemax(UInt32) = 4294967295 -> a restriction on average counter * number of events in cluster)

Not sure I see that this is a limitation of the RNTupleCardinality type. It is currently 32bit limited due to the use of Index32 columns, but with [Split]Index64 columns vector lengths can be 64bit, too.

@Moelf
Copy link
Contributor

Moelf commented Jan 28, 2023

By the way, you don't necessarily need to look at the C++ type to distinguish between offsets and vector lengths. You can reason that a leaf field (role == 0) backed by an [Split]Index[32|64] column

yes, except that column information is not part of field information and (I assumed) the whole point of having field for types and column for data is so one can reason about types separately from primitive data types, which is only a property of the column.

I thought about this more, which seems to be an outlier of the "leaf field" concept. For almost all leaf fields, you have a column which stores one of the primitive data types, and when you index, you get one element from that column.

Maybe the correct way to think about it is it's an additional type of field (like vector, union, struct), except it's marked by a combination or role and type instead of the usual "role is enough".

(the "alias" in this context also seems more generalized than other comparable formats; IIUC, usually, alias means you cast the same bytes to a different type (for example, Switch is just a re-interpretation of a UInt64), but here you have to look beyond the current event and do arithmetics -- not a byte re-interpretation. This kind of generalized re-interpretation usually calls for a separate higher level variation, i.e. field types like vector where we establish "offset" but the "offset" itself is just a plain array)

@jblomer
Copy link
Contributor

jblomer commented Jan 29, 2023

I thought about this more, which seems to be an outlier of the "leaf field" concept. For almost all leaf fields, you have a column which stores one of the primitive data types, and when you index, you get one element from that column.

Note that in general, a field (including leaf field) can have [0..$k$] columns. The first column is the "principal column" that is used to identify one element of that field. For instance, the std::string leaf field has an index and a char column.

@Moelf
Copy link
Contributor

Moelf commented Jan 29, 2023

Indeed! a leaf field with std::string is special because it's the only thing that "splits" in column records (i.e. the string is basically a jagged vector of bytes, but a role=1 jagged vector splits in field records).

Another outlier we will run into someday is _collection0 with only one child-field, this would look identical to a vector (because role=1 and each has only 1 child-field).

Overall, I think there's something powerful about dealing with type schemas without dealing with value information (columns). And I think RNTuple's type system has the potential and capacity (add more numbers to role) to make it happen.

@jalopezg-git
Copy link
Collaborator Author

I have reworded the definition for this field, let me know what you think about it.

@jalopezg-git jalopezg-git merged commit 24ea0de into root-project:master Feb 2, 2023
@jalopezg-git jalopezg-git deleted the ntuple-spec-rntuplecardinality branch February 2, 2023 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants