-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add lookup tables to StructType #12
Conversation
Can we add a test with a map, list, and nested record? This is the test in Python: https://github.com/apache/iceberg/blob/f967317de564ec4af28d962fa48a26844b71007f/python/tests/test_schema.py#L155-L178 This is a great fixture that we use in Python for these kind of tests: https://github.com/apache/iceberg/blob/master/python/tests/conftest.py#L129-L177 |
How are nested structs typically handled? Are they flattened on creation? Does it make sense to flatten them in the constructor? |
@JanKaul They are not flattened. This is why we use the visitor pattern, so we can easily traverse the structure.
Results in:
Let me know if this answers your question. |
Ah okay, I think I had a different idea for the lookup tables. Generally you could think of different lookup tables, these could be:
I've currently implemented 2 & 3 with the goal to speed up the access of a structfield. I didn't think of nested structs and that you need to obtain the full field name. That opens the question how best to implement this. We could use two lookup tables (field id -> full field name, field name -> storage index) which means that an access by field id has to always perform a field name lookup first. |
Thinking of it, I believe it is fine for now. The nested types are more applicable to a schema, rather than a StructType (they are similar, both have a fields object). |
Agree that it's fine for now, to implement 1 we many need some advanced data structures such as prefix tree, anyway we can postpone the optimization later. |
#[serde(rename = "struct", tag = "type")] | ||
pub struct StructType { | ||
/// Struct fields | ||
fields: Vec<StructField>, | ||
/// Lookup for index by field id | ||
#[serde(skip_serializing)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say this is why I want an in-memory represents 😢. Since we have already reached a consensus in the previous discussion, I will not initiate a new round here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to force this serialization/deserialization design. If we don't have a consensus on the design, I think we should get back to the discussion. It doesn't make sense to merge something that might get changed later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per our discussion, if we want to avoid such annotation, we could implement a serializer for StructType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to avoid such annotation, we could implement a serializer for
StructType
.
Currently, it appears that there is no immediate need for this. Let's proceed and observe how the situation develops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it appears that there is no immediate need for this. Let's proceed and observe how the situation develops.
Totally agree, for simple cases like this, it doesn't make code more difficult to read, so we can go on with it.
+1. And I think maintaining lookup table of full field name from a field id will cause that renaming of a field will affect multiple place (cmiiw)🤔.
if b be rename, we also need to rename full name in struct posistion and struct a.
But if the full name is only maintained in schema, I guess there is not this problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This PR adds lookup tables for field ids and field names to the StructType. This speeds up the field access for structs with a lot of fields.