Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent deadlock between RapidsBufferStore and RapidsBufferBase on close #4669

Merged
merged 5 commits into from
Feb 7, 2022

Conversation

abellina
Copy link
Collaborator

@abellina abellina commented Jan 31, 2022

Signed-off-by: Alessandro Bellina abellina@nvidia.com

Targeting this to 22.04 but I can target to 22.02 if desired.

Closes #4664.

Prevents a deadlock between RapidsBufferStore and RapidsBufferBase on shutdown, removing code that was added before that was causing #4664 (this patch no longer stops the context -> no longer causes shuffle id conflicts on register)

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina abellina requested a review from jlowe January 31, 2022 18:53
@sameerz sameerz added the bug Something isn't working label Jan 31, 2022
@sameerz sameerz added this to the Jan 31 - Feb 11 milestone Jan 31, 2022
@abellina abellina changed the title Remove all buffers from the RapidsBufferCatalog on shutdown Prevent deadlock between RapidsBufferStore and RapidsBufferBase on close Feb 1, 2022
@abellina
Copy link
Collaborator Author

abellina commented Feb 1, 2022

This PR: #4374, added a .close for the SparkSession in the mortgage suite, saying that there was a hang, but not explaining the root cause. I removed the .close and that uncovered a deadlock, handled here (I assume that this is the reason for the context close). I reverted my other change to remove all buffers because this keeps the SparkContext alive, and so the conflict of shuffle ids is no longer seen.

That said, I now see device memory leaks. I'll dig at that next. I am not sure if those are related to my change.

@abellina abellina marked this pull request as draft February 1, 2022 06:46
@abellina
Copy link
Collaborator Author

abellina commented Feb 1, 2022

Ok the reason for the memory leaks is because of the shutdown hook logic in the JVM. The shutdown hooks, as far as I understand, run in parallel, so a registered hook for the MemoryCleaner runs at any random time, even before Spark is able to stop it's context, unregister shuffles, etc. One approach that would remove ambiguity would be to change the MemoryCleaner to not hook directly to the JVM, but instead provide the API so that a custom shutdown order could be imposed (potentially registering the MemoryCleaner against Spark's ShutdownHookManager).

I think this is a nice to have but it would make the code easier to follow. It took me a while to realize that the cuDF MemoryCleaner is racing against spark's ShutdownHookManager.

@abellina abellina marked this pull request as ready for review February 1, 2022 18:51
@abellina
Copy link
Collaborator Author

abellina commented Feb 2, 2022

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to file an issue about the underlying problem that still isn't addressed, re: closing and restarting the Spark session

@abellina
Copy link
Collaborator Author

abellina commented Feb 2, 2022

build

@abellina
Copy link
Collaborator Author

abellina commented Feb 4, 2022

build

@abellina
Copy link
Collaborator Author

abellina commented Feb 4, 2022

It is not clear to me why the previous one failed. I am looking to get some logs out of this to see if I can figure it out.

@abellina
Copy link
Collaborator Author

abellina commented Feb 5, 2022

build

@abellina
Copy link
Collaborator Author

abellina commented Feb 7, 2022

I am going to merge this. The previous time this failed, it seemed to die at GpuDeviceManagerSuite. Looking at the test, it seems to be a really simple test, that is just checking values constructed in the suite. So I am not sure why it failed, and I've had successful CI with this patch a few times.

@abellina abellina merged commit 4512075 into NVIDIA:branch-22.04 Feb 7, 2022
@abellina abellina deleted the mortage_debug branch February 7, 2022 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] MortgageAdaptiveSparkSuite failed with duplicate buffer exception
3 participants