Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add size statistics to ParquetMetaData introduced in PARQUET-2261 #5486

Closed
wants to merge 54 commits into from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Mar 8, 2024

Which issue does this PR close?

Closes #5022

Rationale for this change

Implements new page and column chunk size statistics introduced in PARQUET-2261

What changes are included in this PR?

Adds the necessary structures from the updated parquet.thrift, and adds the code necessary to populate them.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 8, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Mar 8, 2024

This is my first foray into Rust programming, so I'm not sure everything is done as idiomatically as possible. Submitting this now to get early feedback on my approach. I'm also wondering how much testing to add, and whether this should have a configuration parameter to turn generating the statistics off.

@alamb
Copy link
Contributor

alamb commented Mar 12, 2024

I triggered the CI checks and will try and get to review this over the next few days if no one beats me to it

@Samrose-Ahmed
Copy link
Contributor

Hey @etseidl are you still working on this, otherwise I had some code in a fork, I can try to move this along.

@alamb
Copy link
Contributor

alamb commented Jun 27, 2024

It seems I never looked at this PR but I just skimmed it and it looks reasonable to me -- I think the biggest thing it is currently lacking is tests

@etseidl
Copy link
Contributor Author

etseidl commented Jun 27, 2024

Yeah, sorry, I got sidetracked by other work...and unit tests are the bane of my existence 😅. Let me see if I can get some added in the next few days.

@etseidl
Copy link
Contributor Author

etseidl commented Jun 27, 2024

@alamb I've added two tests so far. I still need to test the repetition level histogram, but I think I've pushed test_roundtrip() in writer.rs as far as I can. Can you look at what I've done so far and see if it's acceptable?

I'm also not entirely sure how to test the new statistics in the page indexes. For now my story is if the ColumnMetaData::SizeStatistics are correct, then the per-page stats in the page indexes must also be correct.

@etseidl etseidl marked this pull request as ready for review June 28, 2024 15:52
@etseidl
Copy link
Contributor Author

etseidl commented Jul 12, 2024

@alamb I think I've addressed most of your concerns. Please take a look when you have a spare moment. I'm going to set this PR back to draft and begin on splitting into at least 2 more PRs (I've already submitted #6045 for the thrift component).

@etseidl etseidl marked this pull request as draft July 12, 2024 17:24
@alamb
Copy link
Contributor

alamb commented Jul 13, 2024

Certainly. I can submit the format changes first, but I'll keep this open for the time being while I address your comments. Should I open new issues for each PR, or is it sufficient to reference #5022 for all three?

I think using one ticket is just fine. No need to make new tickets

@alamb
Copy link
Contributor

alamb commented Jul 13, 2024

Update: what do you think about doing incremental PRs to a feature branch? #6050

@etseidl
Copy link
Contributor Author

etseidl commented Jul 13, 2024

Update: what do you think about doing incremental PRs to a feature branch? #6050

Sounds good to me. We can all use #6050 to discuss deconfliction in the event of conflicts. And we can afford to be a bit more free wheeling too 😄. For instance, I have a branch for changing the offset_index in ParquetMetaData to a Vec<Vec<ParquetOffsetIndex>>. I almost submitted that PR, but held back because the ripple of changes throughout the parquet code was a bit more than I expected. With the feature branch, I'll feel more comfortable going at a slower rate (add the variable length stats gathering, then change the API, then add histograms, then (perhaps) change the column index API to be consistent with the offset index.

Thank you for helping to coordinate all this ❤️

@etseidl
Copy link
Contributor Author

etseidl commented Jul 26, 2024

Superseded by #6105 and others

@etseidl etseidl closed this Jul 26, 2024
@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

🎉

@etseidl etseidl deleted the size_stats branch July 26, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tracking for Parquet size statistics
3 participants