Fix fake memory leaks in some test cases [databricks] #5955

res-life · 2022-07-06T08:59:18Z

Contributes to #5854

Problem

Prints RapidsHostMemoryStore.pool leaked error log when running Rapids Accelerator test cases.

All tests passed.

22/06/27 17:45:57.298 Thread-7 ERROR HostMemoryBuffer: A HOST BUFFER WAS LEAKED (ID: 1 7f8557fff010)
22/06/27 17:45:57.303 Thread-7 ERROR MemoryCleaner: Leaked host buffer (ID: 1): 2022-06-27 09:45:16.0171 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1559)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:301)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:82)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:232)
ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:98)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:192)
ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:144)
com.nvidia.spark.rapids.RapidsHostMemoryStore.<init>(RapidsHostMemoryStore.scala:38)

Root cause

RapidsHostMemoryStore.pool is not closed before MemoryCleaner checking the leaks.
It's actually not a leak, it's caused by hooks execution order.
RapidsHostMemoryStore.pool is closed in the Spark executor plugin hook.

plugins.foreach(_.shutdown()) // this line will eventually close the RapidsHostMemoryStore.pool
The close path is:

  The close path is: 
    Spark executor plugin hook ->
      RapidsExecutorPlugin.shutdown ->
        GpuDeviceManager.shutdown ->
          RapidsBufferCatalog.close() ->
            RapidsHostMemoryStore.close ->
              RapidsHostMemoryStore.pool.close ->

Solution

First, remove the default hook in MemoryCleaner, then add the default hook by leveraging Spark ShutdownHookManager

See the cuDF side change: rapidsai/cudf#11161

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2022-07-06T09:05:40Z

Depends on rapidsai/cudf#11161

Tested, and it works.

gerashegalov · 2022-07-06T18:01:01Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

+    val REF_COUNT_DEBUG_STR = System.getProperty(MemoryCleaner.REF_COUNT_DEBUG_KEY, "false")
+    if (REF_COUNT_DEBUG_STR.equalsIgnoreCase("true")) {


nit:

Suggested change

val REF_COUNT_DEBUG_STR = System.getProperty(MemoryCleaner.REF_COUNT_DEBUG_KEY, "false")

if (REF_COUNT_DEBUG_STR.equalsIgnoreCase("true")) {

if (java.lang.Boolean.getBoolean(MemoryCleaner.REF_COUNT_DEBUG_KEY)) {

gerashegalov · 2022-07-06T18:04:21Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

+      MemoryCleaner.removeDefaultShutdownHook()
+      // Shutdown hooks are executed concurrently in JVM, and there is no execution order guarantee.
+      // See the doc of `Runtime.addShutdownHook`.
+      // Some resources are closed in Spark hooks.
+      // Here we should wait Spark hooks to be done, or a false leak will be detected.
+      // See issue: https://github.com/NVIDIA/spark-rapids/issues/5854
+      //
+      // `Spark ShutdownHookManager` leverages `Hadoop ShutdownHookManager` to manage hooks with
+      // priority. The priority parameter will guarantee the execution order.
+      //
+      // Here also use `Hadoop ShutdownHookManager` to add a lower priority hook.
+      // 20 priority is small enough, will run after Spark hooks.
+      // Note: `ShutdownHookManager.get()` is a singleton
+      org.apache.hadoop.util.ShutdownHookManager.get().addShutdownHook(
+        MemoryCleaner.DEFAULT_SHUTDOWN_RUNNABLE, 20)


nit: if you follow the suggestion on the cudf PR you could do

Suggested change

MemoryCleaner.removeDefaultShutdownHook()

// Shutdown hooks are executed concurrently in JVM, and there is no execution order guarantee.

// See the doc of `Runtime.addShutdownHook`.

// Some resources are closed in Spark hooks.

// Here we should wait Spark hooks to be done, or a false leak will be detected.

// See issue: https://github.com/NVIDIA/spark-rapids/issues/5854

//

// `Spark ShutdownHookManager` leverages `Hadoop ShutdownHookManager` to manage hooks with

// priority. The priority parameter will guarantee the execution order.

//

// Here also use `Hadoop ShutdownHookManager` to add a lower priority hook.

// 20 priority is small enough, will run after Spark hooks.

// Note: `ShutdownHookManager.get()` is a singleton

org.apache.hadoop.util.ShutdownHookManager.get().addShutdownHook(

MemoryCleaner.DEFAULT_SHUTDOWN_RUNNABLE, 20)

// Shutdown hooks are executed concurrently in JVM, and there is no execution order guarantee.

// See the doc of `Runtime.addShutdownHook`.

// Some resources are closed in Spark hooks.

// Here we should wait Spark hooks to be done, or a false leak will be detected.

// See issue: https://github.com/NVIDIA/spark-rapids/issues/5854

//

// `Spark ShutdownHookManager` leverages `Hadoop ShutdownHookManager` to manage hooks with

// priority. The priority parameter will guarantee the execution order.

//

// Here also use `Hadoop ShutdownHookManager` to add a lower priority hook.

// 20 priority is small enough, will run after Spark hooks.

// Note: `ShutdownHookManager.get()` is a singleton

org.apache.hadoop.util.ShutdownHookManager.get().addShutdownHook(

MemoryCleaner.removeDefaultShutdownHook(), 20)

gerashegalov · 2022-07-06T18:09:35Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

+  /**
+   * Re-register leaks checking hook if configured.
+   */
+  private def ReRegisterCheckLeakHook: Unit = {


nit : this is a method (should start with a lowercase) and it has side effects, should use empty parameter list in parens.

Suggested change

private def ReRegisterCheckLeakHook: Unit = {

private def reRegisterCheckLeakHook(): Unit = {

gerashegalov

LGTM

abellina · 2022-07-07T15:29:55Z

build

res-life · 2022-07-08T01:24:13Z

build

res-life · 2022-07-08T02:43:03Z

build

res-life · 2022-07-08T05:38:42Z

build

…#5955)" This reverts commit ba2682c.

Fix fake memory leaks in some test cases

fd46522

Signed-off-by: Chong Gao <res_life@163.com>

res-life mentioned this pull request Jul 6, 2022

Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager rapidsai/cudf#11161

Merged

res-life requested a review from revans2 July 6, 2022 09:00

res-life marked this pull request as ready for review July 6, 2022 09:02

revans2 previously approved these changes Jul 6, 2022

View reviewed changes

gerashegalov reviewed Jul 6, 2022

View reviewed changes

sameerz added the task Work required that improves the product but is not user facing label Jul 6, 2022

Address comments

ff75fce

res-life dismissed revans2’s stale review via ff75fce July 7, 2022 05:33

Chong Gao added 2 commits July 7, 2022 13:41

Address comments

5482196

Address comments

0453c62

gerashegalov previously approved these changes Jul 7, 2022

View reviewed changes

Update comments

376fe63

res-life dismissed gerashegalov’s stale review via 376fe63 July 7, 2022 06:13

gerashegalov approved these changes Jul 7, 2022

View reviewed changes

revans2 approved these changes Jul 7, 2022

View reviewed changes

abellina approved these changes Jul 7, 2022

View reviewed changes

res-life changed the title ~~Fix fake memory leaks in some test cases~~ Fix fake memory leaks in some test cases [databricks] Jul 8, 2022

res-life merged commit ba2682c into NVIDIA:branch-22.08 Jul 8, 2022

res-life deleted the cleaner-check-leaks branch July 8, 2022 08:36

res-life pushed a commit to res-life/spark-rapids that referenced this pull request Jul 11, 2022

Revert "Fix fake memory leaks in some test cases [databricks] (NVIDIA…

20b8f76

…#5955)" This reverts commit ba2682c.

res-life mentioned this pull request Jul 11, 2022

[BUG] executors shutdown intermittently during integrations test parallel run #5979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fake memory leaks in some test cases [databricks] #5955

Fix fake memory leaks in some test cases [databricks] #5955

res-life commented Jul 6, 2022 •

edited

Loading

res-life commented Jul 6, 2022

gerashegalov Jul 6, 2022

res-life Jul 7, 2022

gerashegalov Jul 6, 2022

res-life Jul 7, 2022

gerashegalov Jul 6, 2022

res-life Jul 7, 2022

gerashegalov left a comment

abellina commented Jul 7, 2022

res-life commented Jul 8, 2022

res-life commented Jul 8, 2022

res-life commented Jul 8, 2022

		val REF_COUNT_DEBUG_STR = System.getProperty(MemoryCleaner.REF_COUNT_DEBUG_KEY, "false")
		if (REF_COUNT_DEBUG_STR.equalsIgnoreCase("true")) {

	val REF_COUNT_DEBUG_STR = System.getProperty(MemoryCleaner.REF_COUNT_DEBUG_KEY, "false")
	if (REF_COUNT_DEBUG_STR.equalsIgnoreCase("true")) {
	if (java.lang.Boolean.getBoolean(MemoryCleaner.REF_COUNT_DEBUG_KEY)) {

	private def ReRegisterCheckLeakHook: Unit = {
	private def reRegisterCheckLeakHook(): Unit = {

Fix fake memory leaks in some test cases [databricks] #5955

Fix fake memory leaks in some test cases [databricks] #5955

Conversation

res-life commented Jul 6, 2022 • edited Loading

Problem

Root cause

Solution

res-life commented Jul 6, 2022

gerashegalov Jul 6, 2022

Choose a reason for hiding this comment

res-life Jul 7, 2022

Choose a reason for hiding this comment

gerashegalov Jul 6, 2022

Choose a reason for hiding this comment

res-life Jul 7, 2022

Choose a reason for hiding this comment

gerashegalov Jul 6, 2022

Choose a reason for hiding this comment

res-life Jul 7, 2022

Choose a reason for hiding this comment

gerashegalov left a comment

Choose a reason for hiding this comment

abellina commented Jul 7, 2022

res-life commented Jul 8, 2022

res-life commented Jul 8, 2022

res-life commented Jul 8, 2022

res-life commented Jul 6, 2022 •

edited

Loading