Handle mmap exception more gracefully in RapidsShuffleServer #3049
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Alessandro Bellina abellina@nvidia.com
Improves handling of a potential
IOException
when attempting tommap
in a system without resources, with many small shuffle blocks, or with low settings forvm.max_map_count
. This is an improvement related to #3040, but it is not the full solution.The root of the problem is that the spilled blocks can be many, and that each block is likely to get
mmap
ed when read or transmitted. When we have many small blocks, the access pattern inRapidsShuffleServer
can create issues as documented in #3040.With this code, I can get q72 at 3TB to fail to
mmap
when there is a surge of requests, but reattempt themmap
successfully. Note that this doesn't prevent other parts of the system to fail when we are close to the OS limits.