Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix page size estimation in Parquet writer #13364

Merged
merged 6 commits into from
May 18, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented May 16, 2023

Description

Fixes #13250. The page size estimation for dictionary encoded pages adds a term to estimate overhead bytes for the bit-packed-header used when encoding bit-packed literal runs. This term originally used a value of 256, but it's hard to see where that value comes from. This PR change the value to 8, with a possible justification being the minimum length of a literal run is 8 values. Worst case would be multiple runs of 8, with required overhead bytes then being num_values/8. This also adds a test that has been verified to fail for values larger than 16 in the problematic term.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner May 16, 2023 20:50
@rapids-bot
Copy link

rapids-bot bot commented May 16, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 16, 2023
@etseidl etseidl changed the title Page size est fix Fix page size estimation in Parquet writer May 16, 2023
@GregoryKimball GregoryKimball requested a review from vuule May 16, 2023 20:54
@GregoryKimball GregoryKimball added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue labels May 16, 2023
@davidwendt davidwendt added the non-breaking Non-breaking change label May 16, 2023
// a "worst-case" for page size estimation in the presence of a dictionary. have confirmed
// that this fails for values over 16 in the final term of `max_RLE_page_size()`.
auto elements0 = cudf::detail::make_counting_transform_iterator(0, [](auto i) {
if ((i / 5) % 2 == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the pattern changed every five elements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked 🤣 This was totally empirical...I changed the divisor until I started getting overwrites in the page encoder. Printing out the run lengths, it seemed to give literal runs of 8 followed by repeated runs of 2. Pretty close to ideal for the purposes of the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seemed to give literal runs of 8 followed by repeated runs of 2

Can you please add this info to the comment above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking seriously about it, the pattern is CCCCCRRRRRCCCCCRRRRR..., where C changes, R is repeating. The encoder will take the first CCCCCRRR and emit a literal run of 8, followed by a repeated run of RR. Then this repeats with another CCCCCRRR, RR. I'd like to say this was rationally designed, but I got lucky 😄

@vuule
Copy link
Contributor

vuule commented May 16, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented May 16, 2023

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great detective work, thank you for the fix!

@etseidl
Copy link
Contributor Author

etseidl commented May 17, 2023

I've been thinking about it some more, and for small dictionary bitwidths I think this will still have issues. The real size for this 8/2 case is going to be (bitwidth * num_vals + 7)/8 + num_vals/8 for the 8/10s of the values that are bit packed, plus (1 + (bitwidth + 7)/8) * num_runs for the 2/10s that are RLE encoded. I plugged that into a spreadsheet and varied the bitwidth, and found that for bitwidth < 8 or equal to 10 or 11 the estimated size is still going to be to low. I have to modify my test to see if I can demonstrate the need for a better RLE estimate.

Going to switch to draft until I can pin this down...don't want to have to hunt this down again 😅

@etseidl etseidl marked this pull request as draft May 17, 2023 01:47
@etseidl etseidl marked this pull request as ready for review May 17, 2023 04:05
@etseidl
Copy link
Contributor Author

etseidl commented May 17, 2023

I did more testing, and when trying to get to low enough bitwidths to trigger problems, the encoder wound up picking different splits, so the current formula was still an over estimation. I'm going to leave max_RLE_page_size as is and add a TODO to reevaluate if the added corruption warning gets triggered.

@ttnghia
Copy link
Contributor

ttnghia commented May 17, 2023

Do you have any benchmark for this fix?

@galipremsagar
Copy link
Contributor

/ok to test

@nvdbaranec
Copy link
Contributor

nvdbaranec commented May 17, 2023

Do you have any benchmark for this fix?

I'm curious too what the downstream effects are of this. Is it something along the lines of:

  • we are generally overestimating the size of the header for bitpacked data in dictionaries.
  • therefore we are overestimating the size of dictionaries
  • therefore we are sometimes not creating a dictionary due to page limit restrictions

?

@etseidl
Copy link
Contributor Author

etseidl commented May 17, 2023

By this point the decision to use a dictionary has been made. This is just calculating the size of buffer needed to hold the encoded data. For the given data, the size estimate is too small, so the encoder overruns the buffer and injects bad keys into a different page's buffer. This is just a more conservative estimate of the space needed to encode. I'd expect a small increase in peak memory, but will be surprised if it's major. More like a few hundred bytes per page.

@nvdbaranec
Copy link
Contributor

Ah, I see. Thanks for the explanation.

@etseidl
Copy link
Contributor Author

etseidl commented May 17, 2023

Here's the peak mem from nvbench where there was a difference. The overhead was a bit more than I thought, but still just under 2% overall.

23.06
| data_type | cardinality | run_length |  CPU Time  | bytes_per_second | peak_memory_usage | 
|-----------|-------------|------------|------------|------------------|-------------------|
|     FLOAT |           0 |          1 | 108.456 ms |       4950351609 |         1.100 GiB |       
|   DECIMAL |           0 |          1 | 112.537 ms |       4770831599 |         1.091 GiB |       
| TIMESTAMP |           0 |          1 | 132.757 ms |       4044169636 |         1.171 GiB |       
|    STRING |           0 |          1 | 133.119 ms |       4033187436 |         1.342 GiB |       
|    STRUCT |           0 |          1 | 146.856 ms |       3655906305 |         1.284 GiB |       
|    STRUCT |           0 |         32 | 119.807 ms |       4481333437 |         1.474 GiB |       


this PR
| data_type | cardinality | run_length |  CPU Time  | bytes_per_second | peak_memory_usage | 
|-----------|-------------|------------|------------|------------------|-------------------|
|     FLOAT |           0 |          1 | 108.067 ms |       4968196421 |         1.122 GiB |       
|   DECIMAL |           0 |          1 | 112.582 ms |       4768935960 |         1.105 GiB |
| TIMESTAMP |           0 |          1 | 132.141 ms |       4063026417 |         1.193 GiB |       
|    STRING |           0 |          1 | 131.600 ms |       4079737251 |         1.351 GiB |       
|    STRUCT |           0 |          1 | 146.687 ms |       3660118264 |         1.300 GiB |       
|    STRUCT |           0 |         32 | 120.336 ms |       4461647293 |         1.501 GiB |   

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp approval.

@vuule
Copy link
Contributor

vuule commented May 18, 2023

/merge

@rapids-bot rapids-bot bot merged commit 8c2f76e into rapidsai:branch-23.06 May 18, 2023
@etseidl etseidl deleted the page_size_est_fix branch May 19, 2023 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Written Parquet File Cannot be Loaded by Other Packages (pandas & dask)
8 participants