Handle mmap exception more gracefully in RapidsShuffleServer #3049

abellina · 2021-07-27T22:56:40Z

Signed-off-by: Alessandro Bellina abellina@nvidia.com

Improves handling of a potential IOException when attempting to mmap in a system without resources, with many small shuffle blocks, or with low settings for vm.max_map_count. This is an improvement related to #3040, but it is not the full solution.

The root of the problem is that the spilled blocks can be many, and that each block is likely to get mmaped when read or transmitted. When we have many small blocks, the access pattern in RapidsShuffleServer can create issues as documented in #3040.

With this code, I can get q72 at 3TB to fail to mmap when there is a surge of requests, but reattempt the mmap successfully. Note that this doesn't prevent other parts of the system to fail when we are close to the OS limits.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/shuffle/BufferSendState.scala

sql-plugin/src/main/scala/org/apache/spark/shuffle/RapidsShuffleExceptions.scala

…server

abellina · 2021-07-28T18:11:23Z

Thanks @jlowe. I added a couple of commits that should address the review comments.

jlowe · 2021-07-28T18:37:06Z

build

Handle mmap exception more gracefully in RapidsShuffleServer

9c9d87f

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

jlowe added the shuffle things that impact the shuffle plugin label Jul 28, 2021

jlowe reviewed Jul 28, 2021

View reviewed changes

abellina added 2 commits July 28, 2021 13:02

Add IOException as a cause, rather than supressed

e6d9dce

Check that IOException is indeed added as cause as we roll up to the …

465799d

…server

jlowe approved these changes Jul 28, 2021

View reviewed changes

abellina merged commit a04baae into NVIDIA:branch-21.08 Jul 28, 2021

abellina deleted the shuffle/handle_mmap_failures_better branch July 28, 2021 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle mmap exception more gracefully in RapidsShuffleServer #3049

Handle mmap exception more gracefully in RapidsShuffleServer #3049

abellina commented Jul 27, 2021 •

edited

Loading

abellina commented Jul 28, 2021

jlowe commented Jul 28, 2021

Handle mmap exception more gracefully in RapidsShuffleServer #3049

Handle mmap exception more gracefully in RapidsShuffleServer #3049

Conversation

abellina commented Jul 27, 2021 • edited Loading

abellina commented Jul 28, 2021

jlowe commented Jul 28, 2021

abellina commented Jul 27, 2021 •

edited

Loading