Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve over-estimating for ORC coalescing reading #3275

Merged
merged 1 commit into from
Aug 24, 2021

Conversation

wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Aug 23, 2021

This PR is for issue #2945 .

Let's assume we're coalescing 2 orc files with 2 stripes for each file in a Spark Task.

orc-coalesce

The original initial estimated size like below,

size_of (“ORC”) + 
size_of stripe1 (DATA + StripeFooter) +
size_of stripe2 (DATA + StripeFooter) + 
size_of stripe3 (DATA + StripeFooter) + 
size_of stripe4 (DATA + StripeFooter) +
size_of FOOTER1 +
size_of FOOTER2  +
256 (Max size of PostScript) + 
1 (Length of PostScript) +
128 * 1024 (Add in a bit of fudging in case the whole file is being consumed and
                   our codec version isn't as efficient as the original writer's codec.)

We have added all ORC file's FOOTER size (size_of FOOTER1 + size_of FOOTER2) in order to estimate the StripeInformation size in FOOTER. It seems a little over-estimating since FOOTER size includes not only StripeInformation but also Schema/ColumnStatistics ...

This PR uses a fixed worst-case StripeInformation size instead of using ORC files' FOOTER size.

Below is the comparing test on 5000 orc files total 1.3G

over-estimating = initialized estimated size - the final true size
spark task number upstream over-estimating size with this PR over-estimating size
12 14.1M 1.6M
162 32.7M 20.6M

Please be note, we have added 128 * 1024 as the fudging. If we reduce this size, the over-estimating can only be 1.6M - 12 * (128*1024) = 102K and 20.6M - 162 * (128*1024) = 358.4K

Signed-off-by: Bobby Wang <wbo4958@gmail.com>
@wbo4958 wbo4958 requested a review from tgravescs August 23, 2021 10:20
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Aug 23, 2021

build

1 similar comment
@pxLi
Copy link
Collaborator

pxLi commented Aug 24, 2021

build

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Aug 24, 2021

@tgravescs Hi Tom, Could you help to review this PR?

@tgravescs
Copy link
Collaborator

tgravescs commented Aug 24, 2021

what is the logic behind the 128 * 1024 size? Have we run into cases we actually hit this and what was it?

Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks fine, just curious if we can improve or document on the 128*1024.

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Aug 24, 2021

overall looks fine, just curious if we can improve or document on the 128*1024.

@tgravescs I don't know, it's from the very beginning. From the comments, looks like the different codec between the original file and the current one we use may consume different buffer.

@wbo4958 wbo4958 merged commit 7f7a4da into NVIDIA:branch-21.10 Aug 24, 2021
@wbo4958 wbo4958 deleted the over-estimating branch August 24, 2021 23:04
@sameerz sameerz added the performance A performance related task/issue label Aug 25, 2021
@sameerz sameerz added this to the Aug 16 - Aug 27 milestone Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants