parquet_derive: Match fields by name, support reading selected fields rather than all #6269

double-free · 2024-08-18T14:39:49Z

Which issue does this PR close?

Closes #6268 .

Rationale for this change

See details in the issue description.

What changes are included in this PR?

Support reading selected fields from a parquet file.

Are there any user-facing changes?

No API changes, but document needs to be updated:

Derive flat, simple RecordReader implementations. Works by parsing a struct tagged with #[derive(ParquetRecordReader)] and emitting the correct writing code for each field of the struct. Column readers are generated in the order they are defined.

Column readers are generated by name instead of index now.

alamb · 2024-08-28T11:24:00Z

parquet_derive_test/src/lib.rs

@@ -70,12 +70,12 @@ struct APartiallyCompleteRecord {
 // If these fields are guaranteed to be valid
 // we can load this struct into APartiallyCompleteRecord
 #[derive(PartialEq, ParquetRecordWriter, Debug)]
-struct APartiallyOptionalRecord {
+struct AnOptionalRecord {


Why did you rename this structure and the fields? It seems like the original name APartiallyOptionalRecord is more accurate and the names of the fields are more specific

yes, I will restore the struct's name, but the field names must be changed. This is because the fact that now the columns are matched with field name (not index, as previous), so the constraint is:

the decoded struct must have consistent field name with the column name in parquet

Given that the struct are written by APartiallyCompleteRecord, maybe_i16 should be renamed back to i16.

alamb

Thank you @double-free -- the explanation and code look good to me. I am sorry for the delay in reviewing

My only question is related to renaming of the struct in the tests -- could you double check that and respond to my question?

Thanks again

alamb · 2024-08-28T11:26:49Z

parquet_derive/src/lib.rs

@@ -206,6 +205,12 @@ pub fn parquet_record_reader(input: proc_macro::TokenStream) -> proc_macro::Toke

        let mut row_group_reader = row_group_reader;

+        // key: parquet file column name, value: column index
+        let mut name_to_index = std::collections::HashMap::new();


I am not sure this strategy will work for nested structures. However, given all the tests pass we likely don't have coverage for such structures so I am not sure the existing code would work for it either.

Thus I think this is fine for now

Yes, agree this is fine for now. As said in the document:

This does not generate readers or writers for arbitrarily nested structures.

This solution is reasonable under this constraint.

double-free · 2024-08-28T17:17:21Z

Thank you @double-free -- the explanation and code look good to me. I am sorry for the delay in reviewing

My only question is related to renaming of the struct in the tests -- could you double check that and respond to my question?

Thanks again

Hi, I have replied to those questions. Please review again.
Btw, can you point me where to update the document of parquet_derive? If we choose to merge the PR, this statement will be outdated:

Column readers are generated in the order they are defined.

Column readers are generated according to names instead of indices now.

alamb · 2024-08-31T12:48:58Z

Btw, can you point me where to update the document of parquet_derive? If we choose to merge the PR, this statement will be outdated:

Column readers are generated in the order they are defined.

It appears from https://github.com/search?q=repo%3Aapache%2Farrow-rs%20%22order%20they%20are%22&type=code

this is controlled by

arrow-rs/parquet_derive/src/lib.rs

Line 149 in 69e5e5f

/// are generated in the order they are defined.

Since I am working to prepare a release now, I will update the docs

alamb · 2024-08-31T12:53:43Z

I have updated the comments and merged up from main. I plan to merge this PR once the CI has completed

alamb · 2024-08-31T13:08:40Z

Thanks again @double-free

double-free · 2024-08-31T13:22:50Z

Thanks again @double-free

No problem, I really appreciate your time, and will continue to contribute once I find reasonable improvements.

Ye Yuan added 3 commits August 18, 2024 00:28

support reading pruned parquet

b767af5

add pruned parquet reading test

f9bb8cf

better unit test

3590f4e

github-actions bot added the parquet-derive label Aug 18, 2024

double-free changed the title ~~Yy/read pruned parquet~~ parquet_derive: support reading selected fields from parquet file Aug 18, 2024

double-free added 4 commits August 18, 2024 22:46

update comments

8337c87

deref instead of clone

5b3f21a

do not panic

4e65cc3

copy integer

950f153

alamb reviewed Aug 28, 2024

View reviewed changes

alamb approved these changes Aug 28, 2024

View reviewed changes

restore struct name

83f5a74

alamb added the api-change Changes to the arrow API label Aug 31, 2024

alamb changed the title ~~parquet_derive: support reading selected fields from parquet file~~ parquet_derive: Match fields by name, support reading selected fields rather than all Aug 31, 2024

alamb added 2 commits August 31, 2024 08:53

update comments

9570f64

Merge remote-tracking branch 'apache/master' into yy/read-pruned-parquet

89385cf

alamb merged commit 3a1f67f into apache:master Aug 31, 2024
11 checks passed

alamb mentioned this pull request Aug 31, 2024

parquet_derive: support reading selected columns from parquet file #6268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet_derive: Match fields by name, support reading selected fields rather than all #6269

parquet_derive: Match fields by name, support reading selected fields rather than all #6269

double-free commented Aug 18, 2024

alamb Aug 28, 2024 •

edited

Loading

double-free Aug 28, 2024 •

edited

Loading

alamb left a comment

alamb Aug 28, 2024

double-free Aug 28, 2024

double-free commented Aug 28, 2024

alamb commented Aug 31, 2024

alamb commented Aug 31, 2024

alamb commented Aug 31, 2024

double-free commented Aug 31, 2024

parquet_derive: Match fields by name, support reading selected fields rather than all #6269

parquet_derive: Match fields by name, support reading selected fields rather than all #6269

Conversation

double-free commented Aug 18, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

double-free Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Aug 28, 2024

Choose a reason for hiding this comment

double-free Aug 28, 2024

Choose a reason for hiding this comment

double-free commented Aug 28, 2024

alamb commented Aug 31, 2024

alamb commented Aug 31, 2024

alamb commented Aug 31, 2024

double-free commented Aug 31, 2024

alamb Aug 28, 2024 •

edited

Loading

double-free Aug 28, 2024 •

edited

Loading