-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Memory leaked in some test cases #5854
Comments
res-life
added
bug
Something isn't working
? - Needs Triage
Need team to review and classify
labels
Jun 16, 2022
This was referenced Jun 28, 2022
The following 3 sub-tasks are caused by one reason: The leak checking hook is not running after the Spark hooks.
|
rapids-bot bot
pushed a commit
to rapidsai/cudf
that referenced
this issue
Jul 7, 2022
… hook in a custom shutdown hook manager (#11161) Contributes to NVIDIA/spark-rapids#5854 ### Problem Prints `RapidsHostMemoryStore.pool` leaked error log when running Rapids Accelerator test cases. ``` All tests passed. 22/06/27 17:45:57.298 Thread-7 ERROR HostMemoryBuffer: A HOST BUFFER WAS LEAKED (ID: 1 7f8557fff010) 22/06/27 17:45:57.303 Thread-7 ERROR MemoryCleaner: Leaked host buffer (ID: 1): 2022-06-27 09:45:16.0171 UTC: INC java.lang.Thread.getStackTrace(Thread.java:1559) ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:301) ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:82) ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:232) ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:98) ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196) ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:192) ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:144) com.nvidia.spark.rapids.RapidsHostMemoryStore.<init>(RapidsHostMemoryStore.scala:38) ``` ### Root cause `RapidsHostMemoryStore.pool` is not closed before `MemoryCleaner` checking the leaks. It's actually not a leak, it's caused by hooks execution order. `RapidsHostMemoryStore.pool` is closed in the [Spark executor plugin hook](https://github.com/apache/spark/blob/v3.3.0/core/src/main/scala/org/apache/spark/executor/Executor.scala#L351toL381). ``` plugins.foreach(_.shutdown()) // this line will eventually close the RapidsHostMemoryStore.pool ``` The close path is: ``` The close path is: Spark executor plugin hook -> RapidsExecutorPlugin.shutdown -> GpuDeviceManager.shutdown -> RapidsBufferCatalog.close() -> RapidsHostMemoryStore.close -> RapidsHostMemoryStore.pool.close -> ``` Rapids Accelerator JNI also checks leaks in a shutdown hook. Shutdown hooks are executed concurrently, there is no execution order guarantee. ### solution 1 - Not recommanded Just wait one second before checking the leak in the `MemoryCleaner`. It's modifying debug code, it's modifying closing code, and has no impact on production code. ### solution 2 - Not recommanded Spark has a util class `ShutdownHookManager` which is a ShutdownHook wrapper. It can [addShutdownHook with priority](https://github.com/apache/spark/blob/v3.3.0/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L152) via `Hadoop ShutdownHookManager` ``` def addShutdownHook(priority: Int)(hook: () => Unit): AnyRef = { ``` Leveraging Hadoop ShutdownHookManager as Spark does is feasible. ### Solution 3 Recommanded Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager. Signed-off-by: Chong Gao <res_life@163.com> Authors: - Chong Gao (https://github.com/res-life) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #11161
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Some test cases have leaked resources although they were passed, better to fix the tests.
There are many
WAS LEAKED
error logs running test cases.These error logs may mask the leaks in the product code.
The following Suites have leak issues, maybe not all.
MemoryCleaner
check the leaks.This is actually not a leak, but the leak checking in
MemoryCleaner
is before the close of RapidsHostMemoryStore.pool.mvn test -Dbuildver=330 -DwildcardSuites=com.nvidia.spark.rapids.ConditionalsSuite
Steps/Code to reproduce bug
Expected behavior
Test against all the Spark versions to guarantee no resources leaked.
Make sure there are no leak messages when
mvn test
The text was updated successfully, but these errors were encountered: