-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent deadlock between RapidsBufferStore and RapidsBufferBase on close #4669
Conversation
Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsBufferCatalog.scala
Outdated
Show resolved
Hide resolved
This PR: #4374, added a That said, I now see device memory leaks. I'll dig at that next. I am not sure if those are related to my change. |
Ok the reason for the memory leaks is because of the shutdown hook logic in the JVM. The shutdown hooks, as far as I understand, run in parallel, so a registered hook for the I think this is a nice to have but it would make the code easier to follow. It took me a while to realize that the cuDF MemoryCleaner is racing against spark's ShutdownHookManager. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to file an issue about the underlying problem that still isn't addressed, re: closing and restarting the Spark session
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsBufferStore.scala
Show resolved
Hide resolved
build |
build |
It is not clear to me why the previous one failed. I am looking to get some logs out of this to see if I can figure it out. |
build |
I am going to merge this. The previous time this failed, it seemed to die at |
Signed-off-by: Alessandro Bellina abellina@nvidia.com
Targeting this to 22.04 but I can target to 22.02 if desired.
Closes #4664.
Prevents a deadlock between
RapidsBufferStore
andRapidsBufferBase
on shutdown, removing code that was added before that was causing #4664 (this patch no longer stops the context -> no longer causes shuffle id conflicts on register)