Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix chunked reads of Parquet delta encoded pages #14921
Fix chunked reads of Parquet delta encoded pages #14921
Changes from 15 commits
915db23
1f9e7db
524149d
f32673f
91df98a
5be31b4
72be052
ba597ac
8c9432e
d5c96c0
86a1210
cf6664b
806d246
74d7e51
91e741d
2cbc917
31c598b
7299aca
c55e639
3f698c5
cc16ca4
18b99c6
fdd77c7
2ca52fe
53434e2
01da47e
9c21cb4
e0ce950
dcfa4ce
cf9619b
1bd0b42
34e3f45
09c26fe
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO for this PR or do we need a tracking issue for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an old TODO I moved from outside the function. @nvdbaranec do you think this TODO is still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to delete. At this point any optimization effort needs to be starting at the top and looking at the whole picture anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I mistaken that this was discussed and decided to find the real size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind the discussion was more about the delta encodings...this is more a note that there could be a faster way to do this (esp since the plain string size calc is agonizingly slow). Unfortunately, at this point we don't really yet know how many values are present, so we can't get an exact number of string bytes. But if an overestimate is ok, we could just use
dict_size
and save a lot of time in this step. Not sure if that's something to address with this PR.Edit: actually, for V2 headers, we do know how many values there are. I'm going to change this and get rid of the TODO.