Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fake memory leaks in some test cases [databricks] #5955

Merged
merged 5 commits into from
Jul 8, 2022

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Jul 6, 2022

Contributes to #5854

Problem

Prints RapidsHostMemoryStore.pool leaked error log when running Rapids Accelerator test cases.

All tests passed.

22/06/27 17:45:57.298 Thread-7 ERROR HostMemoryBuffer: A HOST BUFFER WAS LEAKED (ID: 1 7f8557fff010)
22/06/27 17:45:57.303 Thread-7 ERROR MemoryCleaner: Leaked host buffer (ID: 1): 2022-06-27 09:45:16.0171 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1559)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:301)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:82)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:232)
ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:98)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:192)
ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:144)
com.nvidia.spark.rapids.RapidsHostMemoryStore.<init>(RapidsHostMemoryStore.scala:38)

Root cause

RapidsHostMemoryStore.pool is not closed before MemoryCleaner checking the leaks.
It's actually not a leak, it's caused by hooks execution order.
RapidsHostMemoryStore.pool is closed in the Spark executor plugin hook.

plugins.foreach(_.shutdown()) // this line will eventually close the RapidsHostMemoryStore.pool
The close path is:

  The close path is: 
    Spark executor plugin hook ->
      RapidsExecutorPlugin.shutdown ->
        GpuDeviceManager.shutdown ->
          RapidsBufferCatalog.close() ->
            RapidsHostMemoryStore.close ->
              RapidsHostMemoryStore.pool.close ->

Solution

First, remove the default hook in MemoryCleaner, then add the default hook by leveraging Spark ShutdownHookManager

See the cuDF side change: rapidsai/cudf#11161

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

res-life commented Jul 6, 2022

Depends on rapidsai/cudf#11161

Tested, and it works.

revans2
revans2 previously approved these changes Jul 6, 2022
Comment on lines 273 to 274
val REF_COUNT_DEBUG_STR = System.getProperty(MemoryCleaner.REF_COUNT_DEBUG_KEY, "false")
if (REF_COUNT_DEBUG_STR.equalsIgnoreCase("true")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
val REF_COUNT_DEBUG_STR = System.getProperty(MemoryCleaner.REF_COUNT_DEBUG_KEY, "false")
if (REF_COUNT_DEBUG_STR.equalsIgnoreCase("true")) {
if (java.lang.Boolean.getBoolean(MemoryCleaner.REF_COUNT_DEBUG_KEY)) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 275 to 289
MemoryCleaner.removeDefaultShutdownHook()
// Shutdown hooks are executed concurrently in JVM, and there is no execution order guarantee.
// See the doc of `Runtime.addShutdownHook`.
// Some resources are closed in Spark hooks.
// Here we should wait Spark hooks to be done, or a false leak will be detected.
// See issue: https://github.com/NVIDIA/spark-rapids/issues/5854
//
// `Spark ShutdownHookManager` leverages `Hadoop ShutdownHookManager` to manage hooks with
// priority. The priority parameter will guarantee the execution order.
//
// Here also use `Hadoop ShutdownHookManager` to add a lower priority hook.
// 20 priority is small enough, will run after Spark hooks.
// Note: `ShutdownHookManager.get()` is a singleton
org.apache.hadoop.util.ShutdownHookManager.get().addShutdownHook(
MemoryCleaner.DEFAULT_SHUTDOWN_RUNNABLE, 20)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you follow the suggestion on the cudf PR you could do

Suggested change
MemoryCleaner.removeDefaultShutdownHook()
// Shutdown hooks are executed concurrently in JVM, and there is no execution order guarantee.
// See the doc of `Runtime.addShutdownHook`.
// Some resources are closed in Spark hooks.
// Here we should wait Spark hooks to be done, or a false leak will be detected.
// See issue: https://github.com/NVIDIA/spark-rapids/issues/5854
//
// `Spark ShutdownHookManager` leverages `Hadoop ShutdownHookManager` to manage hooks with
// priority. The priority parameter will guarantee the execution order.
//
// Here also use `Hadoop ShutdownHookManager` to add a lower priority hook.
// 20 priority is small enough, will run after Spark hooks.
// Note: `ShutdownHookManager.get()` is a singleton
org.apache.hadoop.util.ShutdownHookManager.get().addShutdownHook(
MemoryCleaner.DEFAULT_SHUTDOWN_RUNNABLE, 20)
// Shutdown hooks are executed concurrently in JVM, and there is no execution order guarantee.
// See the doc of `Runtime.addShutdownHook`.
// Some resources are closed in Spark hooks.
// Here we should wait Spark hooks to be done, or a false leak will be detected.
// See issue: https://github.com/NVIDIA/spark-rapids/issues/5854
//
// `Spark ShutdownHookManager` leverages `Hadoop ShutdownHookManager` to manage hooks with
// priority. The priority parameter will guarantee the execution order.
//
// Here also use `Hadoop ShutdownHookManager` to add a lower priority hook.
// 20 priority is small enough, will run after Spark hooks.
// Note: `ShutdownHookManager.get()` is a singleton
org.apache.hadoop.util.ShutdownHookManager.get().addShutdownHook(
MemoryCleaner.removeDefaultShutdownHook(), 20)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/**
* Re-register leaks checking hook if configured.
*/
private def ReRegisterCheckLeakHook: Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : this is a method (should start with a lowercase) and it has side effects, should use empty parameter list in parens.

Suggested change
private def ReRegisterCheckLeakHook: Unit = {
private def reRegisterCheckLeakHook(): Unit = {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sameerz sameerz added the task Work required that improves the product but is not user facing label Jul 6, 2022
gerashegalov
gerashegalov previously approved these changes Jul 7, 2022
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abellina
Copy link
Collaborator

abellina commented Jul 7, 2022

build

2 similar comments
@res-life
Copy link
Collaborator Author

res-life commented Jul 8, 2022

build

@res-life
Copy link
Collaborator Author

res-life commented Jul 8, 2022

build

@res-life res-life changed the title Fix fake memory leaks in some test cases Fix fake memory leaks in some test cases [databricks] Jul 8, 2022
@res-life
Copy link
Collaborator Author

res-life commented Jul 8, 2022

build

@res-life res-life merged commit ba2682c into NVIDIA:branch-22.08 Jul 8, 2022
@res-life res-life deleted the cleaner-check-leaks branch July 8, 2022 08:36
res-life pushed a commit to res-life/spark-rapids that referenced this pull request Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants