Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle mmap exception more gracefully in RapidsShuffleServer #3049

Merged

Conversation

abellina
Copy link
Collaborator

@abellina abellina commented Jul 27, 2021

Signed-off-by: Alessandro Bellina abellina@nvidia.com

Improves handling of a potential IOException when attempting to mmap in a system without resources, with many small shuffle blocks, or with low settings for vm.max_map_count. This is an improvement related to #3040, but it is not the full solution.

The root of the problem is that the spilled blocks can be many, and that each block is likely to get mmaped when read or transmitted. When we have many small blocks, the access pattern in RapidsShuffleServer can create issues as documented in #3040.

With this code, I can get q72 at 3TB to fail to mmap when there is a surge of requests, but reattempt the mmap successfully. Note that this doesn't prevent other parts of the system to fail when we are close to the OS limits.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@jlowe jlowe added the shuffle things that impact the shuffle plugin label Jul 28, 2021
@abellina
Copy link
Collaborator Author

Thanks @jlowe. I added a couple of commits that should address the review comments.

@jlowe
Copy link
Member

jlowe commented Jul 28, 2021

build

@abellina abellina merged commit a04baae into NVIDIA:branch-21.08 Jul 28, 2021
@abellina abellina deleted the shuffle/handle_mmap_failures_better branch July 28, 2021 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
shuffle things that impact the shuffle plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants