Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak on dataset iteration #2289

Closed
enpasos opened this issue Jan 4, 2023 · 5 comments
Closed

memory leak on dataset iteration #2289

enpasos opened this issue Jan 4, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@enpasos
Copy link
Contributor

enpasos commented Jan 4, 2023

Description

On running the FashionMnist example from DJL Docs I experience a GPU memory leak of about 503 Bytes on each dataset iteration.
grafik
illustrates the memory grows on GPU per epoch.

I see this increase even if the batch iteration is reduced to just the iteration without doing something else.
I experience this loss without and with the suggested fix #2273 to clean up orphaned NDArrays.

Expected Behavior

No memory leak.

How to Reproduce?

I set up a toy app based on djl fashion mnist to reproduce the problem I experience:

git clone https://github.com/enpasos/reproducebug4.git
cd reproducebug4
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

To further localize the cause:

git clone https://github.com/enpasos/reproducebug4.git
cd reproducebug4
git checkout localizing_memory_leak
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

What have you tried to solve it?

Looking for the cause. Did not find it yet.

Environment Info

  • GPU: NVIDIA Quadro RTX 5000
  • CPU: Intel Xeon E-2286M
  • RAM: 64 GB
  • OS: Windows 11
  • GPU Driver: 527.27
  • CUDA SDK: 11.7.1
  • CUDNN: cudnn-windows-x86_64-8.7.0.84_cuda11
  • Java: Corretto-17.0.3
  • DJL: 0.21.0-SNAPSHOT (04.01.2023)
  • PYTORCH: 1.13.1
@enpasos enpasos added the bug Something isn't working label Jan 4, 2023
@KexinFeng
Copy link
Contributor

Is it the same issue as #2210? Is it solved after applying the patch #2232?

@enpasos
Copy link
Contributor Author

enpasos commented Jan 6, 2023

Is it solved after applying the patch #2232?

To check I took your latest #2232

git clone https://github.com/KexinFeng/djl.git
cd djl
gradlew build -x test
gradlew publishToMavenLocal

and ran the code to reproduce the bug against

git clone https://github.com/enpasos/reproducebug4.git
cd reproducebug4
git checkout localizing_memory_leak
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

but I still see the same bug behaviour

[main] INFO com.enpasos.bugs.Main - ###################################################
[main] INFO com.enpasos.bugs.Main - memory leak of about 503 Bytes/epoch/batch
[main] INFO com.enpasos.bugs.Main - ###################################################

@enpasos
Copy link
Contributor Author

enpasos commented Jan 6, 2023

Is it the same issue as #2210?

It is the same field of problem. The impact of the bug behaviour from #2210? reproduced by

git clone https://github.com/enpasos/reproducebug2.git
cd reproducebug2
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

is eliminated by the proposed solution #2273. (I am not using the word solved here as the suggested solution cleans the garbage, but in the ideal solution there would not be garbage).

However, the behaviour reported here even shows after applying the patch #2273.

@KexinFeng
Copy link
Contributor

KexinFeng commented Jan 6, 2023

@enpasos I think I find the possible root cause. Basically, the FashionMnist extends ArrayDataset. The iteration of this data set utilizes the new advanced indexing feature to achieve efficiency optimization, which is introduced in #1869.

The advanced indexing has memory leak issue, which is now fixed in #2300. So this is the possible root cause. You can apply this patch, then the memory leak issue is expected to be fixed.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 7, 2023

Concrats for eliminating the root cause for this memory leak! Very nice :-)
I ran the test case and no more memory leak here.

@enpasos enpasos closed this as completed Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants