-
Notifications
You must be signed in to change notification settings - Fork 790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet_derive: Match fields by name, support reading selected fields rather than all #6269
Conversation
parquet_derive_test/src/lib.rs
Outdated
@@ -70,12 +70,12 @@ struct APartiallyCompleteRecord { | |||
// If these fields are guaranteed to be valid | |||
// we can load this struct into APartiallyCompleteRecord | |||
#[derive(PartialEq, ParquetRecordWriter, Debug)] | |||
struct APartiallyOptionalRecord { | |||
struct AnOptionalRecord { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you rename this structure and the fields? It seems like the original name APartiallyOptionalRecord
is more accurate and the names of the fields are more specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I will restore the struct's name, but the field names must be changed. This is because the fact that now the columns are matched with field name (not index, as previous), so the constraint is:
the decoded struct must have consistent field name with the column name in parquet
Given that the struct are written by APartiallyCompleteRecord
, maybe_i16
should be renamed back to i16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @double-free -- the explanation and code look good to me. I am sorry for the delay in reviewing
My only question is related to renaming of the struct
in the tests -- could you double check that and respond to my question?
Thanks again
@@ -206,6 +205,12 @@ pub fn parquet_record_reader(input: proc_macro::TokenStream) -> proc_macro::Toke | |||
|
|||
let mut row_group_reader = row_group_reader; | |||
|
|||
// key: parquet file column name, value: column index | |||
let mut name_to_index = std::collections::HashMap::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this strategy will work for nested structures. However, given all the tests pass we likely don't have coverage for such structures so I am not sure the existing code would work for it either.
Thus I think this is fine for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree this is fine for now. As said in the document:
This does not generate readers or writers for arbitrarily nested structures.
This solution is reasonable under this constraint.
Hi, I have replied to those questions. Please review again.
Column readers are generated according to names instead of indices now. |
It appears from https://github.com/search?q=repo%3Aapache%2Farrow-rs%20%22order%20they%20are%22&type=code this is controlled by arrow-rs/parquet_derive/src/lib.rs Line 149 in 69e5e5f
Since I am working to prepare a release now, I will update the docs |
I have updated the comments and merged up from main. I plan to merge this PR once the CI has completed |
Thanks again @double-free |
No problem, I really appreciate your time, and will continue to contribute once I find reasonable improvements. |
Which issue does this PR close?
Closes #6268 .
Rationale for this change
See details in the issue description.
What changes are included in this PR?
Support reading selected fields from a parquet file.
Are there any user-facing changes?
No API changes, but document needs to be updated:
Column readers are generated by name instead of index now.