Improve over-estimating for ORC coalescing reading #3275

wbo4958 · 2021-08-23T10:20:17Z

This PR is for issue #2945 .

Let's assume we're coalescing 2 orc files with 2 stripes for each file in a Spark Task.

The original initial estimated size like below,

size_of (“ORC”) + 
size_of stripe1 (DATA + StripeFooter) +
size_of stripe2 (DATA + StripeFooter) + 
size_of stripe3 (DATA + StripeFooter) + 
size_of stripe4 (DATA + StripeFooter) +
size_of FOOTER1 +
size_of FOOTER2  +
256 (Max size of PostScript) + 
1 (Length of PostScript) +
128 * 1024 (Add in a bit of fudging in case the whole file is being consumed and
                   our codec version isn't as efficient as the original writer's codec.)

We have added all ORC file's FOOTER size (size_of FOOTER1 + size_of FOOTER2) in order to estimate the StripeInformation size in FOOTER. It seems a little over-estimating since FOOTER size includes not only StripeInformation but also Schema/ColumnStatistics ...

This PR uses a fixed worst-case StripeInformation size instead of using ORC files' FOOTER size.

Below is the comparing test on 5000 orc files total 1.3G

over-estimating = initialized estimated size - the final true size

spark task number	upstream over-estimating size	with this PR over-estimating size
12	14.1M	1.6M
162	32.7M	20.6M

Please be note, we have added 128 * 1024 as the fudging. If we reduce this size, the over-estimating can only be 1.6M - 12 * (128*1024) = 102K and 20.6M - 162 * (128*1024) = 358.4K

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

wbo4958 · 2021-08-23T10:20:38Z

build

pxLi · 2021-08-24T01:09:56Z

build

wbo4958 · 2021-08-24T03:04:08Z

@tgravescs Hi Tom, Could you help to review this PR?

tgravescs · 2021-08-24T13:22:12Z

what is the logic behind the 128 * 1024 size? Have we run into cases we actually hit this and what was it?

tgravescs

overall looks fine, just curious if we can improve or document on the 128*1024.

wbo4958 · 2021-08-24T23:04:19Z

overall looks fine, just curious if we can improve or document on the 128*1024.

@tgravescs I don't know, it's from the very beginning. From the comments, looks like the different codec between the original file and the current one we use may consume different buffer.

Improve over-estimating for ORC coalescing reading

8212c22

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

wbo4958 requested a review from tgravescs August 23, 2021 10:20

tgravescs approved these changes Aug 24, 2021

View reviewed changes

wbo4958 merged commit 7f7a4da into NVIDIA:branch-21.10 Aug 24, 2021

wbo4958 deleted the over-estimating branch August 24, 2021 23:04

sameerz added the performance A performance related task/issue label Aug 25, 2021

sameerz added this to the Aug 16 - Aug 27 milestone Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve over-estimating for ORC coalescing reading #3275

Improve over-estimating for ORC coalescing reading #3275

wbo4958 commented Aug 23, 2021

wbo4958 commented Aug 23, 2021

pxLi commented Aug 24, 2021

wbo4958 commented Aug 24, 2021

tgravescs commented Aug 24, 2021 •

edited

Loading

tgravescs left a comment

wbo4958 commented Aug 24, 2021

Improve over-estimating for ORC coalescing reading #3275

Improve over-estimating for ORC coalescing reading #3275

Conversation

wbo4958 commented Aug 23, 2021

wbo4958 commented Aug 23, 2021

pxLi commented Aug 24, 2021

wbo4958 commented Aug 24, 2021

tgravescs commented Aug 24, 2021 • edited Loading

tgravescs left a comment

Choose a reason for hiding this comment

wbo4958 commented Aug 24, 2021

tgravescs commented Aug 24, 2021 •

edited

Loading