Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet_derive: Match fields by name, support reading selected fields rather than all #6269

Merged
merged 10 commits into from
Aug 31, 2024

Conversation

double-free
Copy link
Contributor

Which issue does this PR close?

Closes #6268 .

Rationale for this change

See details in the issue description.

What changes are included in this PR?

Support reading selected fields from a parquet file.

Are there any user-facing changes?

No API changes, but document needs to be updated:

Derive flat, simple RecordReader implementations. Works by parsing a struct tagged with #[derive(ParquetRecordReader)] and emitting the correct writing code for each field of the struct. Column readers are generated in the order they are defined.

Column readers are generated by name instead of index now.

@double-free double-free changed the title Yy/read pruned parquet parquet_derive: support reading selected fields from parquet file Aug 18, 2024
@@ -70,12 +70,12 @@ struct APartiallyCompleteRecord {
// If these fields are guaranteed to be valid
// we can load this struct into APartiallyCompleteRecord
#[derive(PartialEq, ParquetRecordWriter, Debug)]
struct APartiallyOptionalRecord {
struct AnOptionalRecord {
Copy link
Contributor

@alamb alamb Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you rename this structure and the fields? It seems like the original name APartiallyOptionalRecord is more accurate and the names of the fields are more specific

Copy link
Contributor Author

@double-free double-free Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I will restore the struct's name, but the field names must be changed. This is because the fact that now the columns are matched with field name (not index, as previous), so the constraint is:

the decoded struct must have consistent field name with the column name in parquet

Given that the struct are written by APartiallyCompleteRecord, maybe_i16 should be renamed back to i16.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @double-free -- the explanation and code look good to me. I am sorry for the delay in reviewing

My only question is related to renaming of the struct in the tests -- could you double check that and respond to my question?

Thanks again

@@ -206,6 +205,12 @@ pub fn parquet_record_reader(input: proc_macro::TokenStream) -> proc_macro::Toke

let mut row_group_reader = row_group_reader;

// key: parquet file column name, value: column index
let mut name_to_index = std::collections::HashMap::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this strategy will work for nested structures. However, given all the tests pass we likely don't have coverage for such structures so I am not sure the existing code would work for it either.

Thus I think this is fine for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree this is fine for now. As said in the document:

This does not generate readers or writers for arbitrarily nested structures.

This solution is reasonable under this constraint.

@double-free
Copy link
Contributor Author

Thank you @double-free -- the explanation and code look good to me. I am sorry for the delay in reviewing

My only question is related to renaming of the struct in the tests -- could you double check that and respond to my question?

Thanks again

Hi, I have replied to those questions. Please review again.
Btw, can you point me where to update the document of parquet_derive? If we choose to merge the PR, this statement will be outdated:

Column readers are generated in the order they are defined.

Column readers are generated according to names instead of indices now.

@alamb
Copy link
Contributor

alamb commented Aug 31, 2024

Btw, can you point me where to update the document of parquet_derive? If we choose to merge the PR, this statement will be outdated:

Column readers are generated in the order they are defined.

It appears from https://github.com/search?q=repo%3Aapache%2Farrow-rs%20%22order%20they%20are%22&type=code

this is controlled by

/// are generated in the order they are defined.

Since I am working to prepare a release now, I will update the docs

@alamb alamb added the api-change Changes to the arrow API label Aug 31, 2024
@alamb alamb changed the title parquet_derive: support reading selected fields from parquet file parquet_derive: Match fields by name, support reading selected fields rather than all Aug 31, 2024
@alamb
Copy link
Contributor

alamb commented Aug 31, 2024

I have updated the comments and merged up from main. I plan to merge this PR once the CI has completed

@alamb alamb merged commit 3a1f67f into apache:master Aug 31, 2024
11 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 31, 2024

Thanks again @double-free

@double-free
Copy link
Contributor Author

Thanks again @double-free

No problem, I really appreciate your time, and will continue to contribute once I find reasonable improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet-derive
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet_derive: support reading selected columns from parquet file
2 participants