-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optionally close orphaned NDArrays using Java garbage collection #2273
Conversation
# Conflicts: # engines/pytorch/pytorch-engine/src/main/java/ai/djl/pytorch/engine/PtNDArray.java
Codecov ReportBase: 72.08% // Head: 74.14% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #2273 +/- ##
============================================
+ Coverage 72.08% 74.14% +2.05%
- Complexity 5126 6738 +1612
============================================
Files 473 665 +192
Lines 21970 29333 +7363
Branches 2351 3033 +682
============================================
+ Hits 15838 21750 +5912
- Misses 4925 6109 +1184
- Partials 1207 1474 +267
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
This approach is very similar to reference counting, which is used widely in C++. Here I just leave a reference implementation: https://github.com/almson/almson-refcount will take a look after holiday. Thanks for your contribution! A great mile to explore something alternative. But I would try to benchmark in concurrent mode, to see what happened in parallel block |
Another concern is the implementation on the reference queue: seemed all operation is involved with a synchronous checkQueue operation, what is the impact if we frequently adding/removing resources? How much chances we will be blocked by that |
In the proxyMaker implementation you used a UUID random generation (synchronize call) and not picking up the uid. UID is also unique across the resources |
On the |
I'm not sure if I got your concern right. It is not necessary to use an UUID here - an Integer used right - would also do. This key just needs to be an object that is unique within the Maybe, if your intention is to use the UID edited 04.01.2023: I replaced the UUID by the uid+"-"+counter. Only the uid from |
@frankfliu Thx a lot for looking into the suggested solution. I will try my best to solve the issues you mentioned tomorrow. One comment on the threading issue: If I understand you right it is necessary to have the closing of an NDArray done by the same thread as the creation. If so I would replace the global Queue by a ThreadLocal Queue. One more point for now: I must confess that in the heat of the moment during the last merges with the master branch I didn't pay attention to my memory leak test ... I'll try to fix that tomorrow as well. |
ThreadLocal has its own issue. It's hard to clean up ThreadLocal memory. We do see some customer create and distroy thread a lot. |
What would you suggest? Give it a try? ... I'll give it a try and keep your comment in mind. |
I repaired the merge (everythink merged but "A temporary solution to issue 2210 (#2304)"). Memory leak test works now. I will now try to solve the issues mentioned in the comments. |
I changed the code to use threadLocal queues. |
I looked a bit more into this PR, I don't think this solution can fly:
One workaround could be adding a
|
Looks like you looked at However, the
This test starts creating three NDArrays a,b,c. Because When a,b,c are no longer referenced the dynamic proxies and the uid key are ready to be garbage collected. However, we need to wait for the next minor GC. Therefore still -> To see a GC we do something we would not do in an application, we encourage GC and wait for 1 s. And in this case checkQueue got three messages from GC and closes the corresponding NDArrays. -> If I find time today I will set up an example for the testsuite. For me, |
If Do you think that the NDScope approach works together with the GC? |
… WeakHashMapWrapper
I added a method gc() to NDManager that explicitly calls checkQueue on WeakHashMapWrapper. Just to show this function, I added a small Main2 corresponding to Main. Both classes should be deleted. A test suite #2290 should provide realistic test cases. |
As a starting point for a test suite #2290 I pasted some examples from the DJL Docs into
and tested for duration and memory: |
I thought about your
I think you could have this approach in principle without GC and without NDManager hierarchy. |
|
@frankfliu Thx a lot for looking into this option. I totally agree that deterministic behavior is very desirable. |
I am closing this PR in favour of the deterministic solution #2321. |
Description
This PR is a solution suggestion to #2210.
A striking example of the impact shows the following figure (to reproduce see #2210):
Design considerations: DJLDiscussionInputVersion4.pdf
Opt-in:
SwitchGarbageCollection.on();
The implementation here is done for PyTorch, but the solution approach is general.