Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optionally close orphaned NDArrays using Java garbage collection #2273

Closed
wants to merge 37 commits into from

Conversation

enpasos
Copy link
Contributor

@enpasos enpasos commented Dec 31, 2022

Description

This PR is a solution suggestion to #2210.

A striking example of the impact shows the following figure (to reproduce see #2210):

grafik

Design considerations: DJLDiscussionInputVersion4.pdf

Opt-in: SwitchGarbageCollection.on();

The implementation here is done for PyTorch, but the solution approach is general.

@codecov-commenter
Copy link

codecov-commenter commented Dec 31, 2022

Codecov Report

Base: 72.08% // Head: 74.14% // Increases project coverage by +2.05% 🎉

Coverage data is based on head (6cf606e) compared to base (bb5073f).
Patch coverage: 72.26% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2273      +/-   ##
============================================
+ Coverage     72.08%   74.14%   +2.05%     
- Complexity     5126     6738    +1612     
============================================
  Files           473      665     +192     
  Lines         21970    29333    +7363     
  Branches       2351     3033     +682     
============================================
+ Hits          15838    21750    +5912     
- Misses         4925     6109    +1184     
- Partials       1207     1474     +267     
Impacted Files Coverage Δ
api/src/main/java/ai/djl/modality/cv/Image.java 69.23% <ø> (-4.11%) ⬇️
...rc/main/java/ai/djl/modality/cv/MultiBoxPrior.java 76.00% <ø> (ø)
...rc/main/java/ai/djl/modality/cv/output/Joints.java 71.42% <ø> (ø)
.../main/java/ai/djl/modality/cv/output/Landmark.java 100.00% <ø> (ø)
...main/java/ai/djl/modality/cv/output/Rectangle.java 72.41% <0.00%> (ø)
...i/djl/modality/cv/translator/BigGANTranslator.java 21.42% <0.00%> (-5.24%) ⬇️
.../modality/cv/translator/ImageFeatureExtractor.java 0.00% <0.00%> (ø)
.../ai/djl/modality/cv/translator/YoloTranslator.java 27.77% <0.00%> (+18.95%) ⬆️
...ain/java/ai/djl/modality/cv/util/NDImageUtils.java 67.10% <0.00%> (+7.89%) ⬆️
api/src/main/java/ai/djl/modality/nlp/Decoder.java 63.63% <ø> (ø)
... and 607 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@lanking520
Copy link
Contributor

This approach is very similar to reference counting, which is used widely in C++. Here I just leave a reference implementation: https://github.com/almson/almson-refcount will take a look after holiday.

Thanks for your contribution! A great mile to explore something alternative. But I would try to benchmark in concurrent mode, to see what happened in parallel block

@lanking520
Copy link
Contributor

Another concern is the implementation on the reference queue: seemed all operation is involved with a synchronous checkQueue operation, what is the impact if we frequently adding/removing resources? How much chances we will be blocked by that

@lanking520
Copy link
Contributor

In the proxyMaker implementation you used a UUID random generation (synchronize call) and not picking up the uid. UID is also unique across the resources

@enpasos
Copy link
Contributor Author

enpasos commented Dec 31, 2022

Another concern is the implementation on the reference queue: seemed all operation is involved with a synchronous checkQueue operation, what is the impact if we frequently adding/removing resources? How much chances we will be blocked by that

On the ReferenceQueue there are two fast methods enqueue and poll. enqueue is called once per reference, poll on every method call. The poll should normally be ultra fast - if no head is on the queue it immediately returns. Blocking could in principle occur in a multithreading environment for the time it takes to handle all the References that are on the queue at a particular method call. But are the NDArrays for PyTorch threadsafe? I have the following comment on #2210 in mind:
grafik
To be on the safe side, should there not in any case be a synchronizing facade in front of the PyTorch calls? And if so there will not be an extra blocking on the reference queue.

@enpasos
Copy link
Contributor Author

enpasos commented Dec 31, 2022

In the proxyMaker implementation you used a UUID random generation (synchronize call) and not picking up the uid. UID is also unique across the resources

I'm not sure if I got your concern right. It is not necessary to use an UUID here - an Integer used right - would also do. This key just needs to be an object that is unique within the PtNDArrayProxyMaker. It gets garbage collected together with the proxy when the user has no more references to the proxy while there still is a reference from the NDManager to the corresponding value PtNDArrayImpl. If your concern is about the particular UID implementation I could replace it with an Integer counter.

Maybe, if your intention is to use the UID uid = handle.toString(); that is used inside PtNDArrayImpl: One can certainly use a copy of the String but not the original String object, because it is referenced.

edited 04.01.2023: I replaced the UUID by the uid+"-"+counter. Only the uid from PtNDArrayImpl won't do because the lifespan of the handle and the PtNDArrayImpl are not exactly the same.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 8, 2023

@frankfliu Thx a lot for looking into the suggested solution. I will try my best to solve the issues you mentioned tomorrow.

One comment on the threading issue: If I understand you right it is necessary to have the closing of an NDArray done by the same thread as the creation. If so I would replace the global Queue by a ThreadLocal Queue.

One more point for now: I must confess that in the heat of the moment during the last merges with the master branch I didn't pay attention to my memory leak test ... I'll try to fix that tomorrow as well.

@frankfliu
Copy link
Contributor

ThreadLocal has its own issue. It's hard to clean up ThreadLocal memory. We do see some customer create and distroy thread a lot.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 8, 2023

ThreadLocal has its own issue. It's hard to clean up ThreadLocal memory. We do see some customer create and distroy thread a lot.

What would you suggest? Give it a try? ... I'll give it a try and keep your comment in mind.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 9, 2023

One more point for now: I must confess that in the heat of the moment during the last merges with the master branch I didn't pay attention to my memory leak test ... I'll try to fix that tomorrow as well.

I repaired the merge (everythink merged but "A temporary solution to issue 2210 (#2304)"). Memory leak test works now. I will now try to solve the issues mentioned in the comments.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 9, 2023

ThreadLocal has its own issue. It's hard to clean up ThreadLocal memory. We do see some customer create and distroy thread a lot.

What would you suggest? Give it a try? ... I'll give it a try and keep your comment in mind.

I changed the code to use threadLocal queues.
In PtNDArrayProxyMaker a ThreadLocal holds the WeakHashMapWrappers for all the threads.
A smoketest with 1000 threads does not reveal any problems.

@frankfliu
Copy link
Contributor

frankfliu commented Jan 11, 2023

I looked a bit more into this PR, I don't think this solution can fly:

  1. clean up NDArray relies on calling other NDArray's methods. If the code only create NDArray but not calling any operators, the NDArray won't be closed
  2. The test code actually doesn't really work:
    1. mapSize() doesn't work as expected, when GC happened, the size won't be updated
    2. checkQueue() is not actually called, it triggered because the code manually invoked mapSize(), this won't happen in real world.

One workaround could be adding a NDManager.gc() function and let user manually call it when needed, it scans the current thread's orphan NDArrays and close them. But I personally don't like NDManager.gc() solution. A better alternative would be implement NDScope class:

try (NDScope scope = new NDScope()) {
     // Create NDArrays
} // NDArrays created in this scope will be closed automatically

@enpasos
Copy link
Contributor Author

enpasos commented Jan 11, 2023

The test code actually doesn't really work:
1. mapSize() doesn't work as expected, when GC happened, the size won't be updated
2. checkQueue() is not actually called, it triggered because the code manually invoked mapSize(), this won't happen in real world.

Looks like you looked at ai.djl.pytorch.integration.gc.Main. This Class was actually just a very basic smoke test for me when setting up the solution initially. I would rather delete it from this PR and set up a test suite #2290 following the tests in #2210.

However, the ai.djl.pytorch.integration.gc.Main still works as expected for me:
When I run it I see

> Task :engines:pytorch:pytorch-engine:Main.main()
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of inter-op threads is 8
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of intra-op threads is 8
[main] INFO ai.djl.pytorch.integration.gc.Main - number of NDArrays in NDManager hierarchy 3
[main] INFO ai.djl.pytorch.integration.gc.Main - number of NDArrays in map used of gc triggered NDArray closing 3
[main] INFO ai.djl.pytorch.integration.gc.Main - reference exists ...
[main] INFO ai.djl.pytorch.integration.gc.Main - weakHashMap size: 3
[main] INFO ai.djl.pytorch.integration.gc.Main - no reference exists, but likely not yet garbage collected ...
[main] INFO ai.djl.pytorch.integration.gc.Main - weakHashMap size: 3
\--- NDManager(e8ac5d45a276, gpu(0)) resource count: 1
    \--- NDManager(e847df9e9799, gpu(0)) resource count: 1
        \--- NDManager(db5ef3d00e38, gpu(0)) resource count: 3
            \--- NDArray(2057224517712, Shape(1))
            \--- NDArray(2057224517184, Shape(1))
            \--- NDArray(2057224517632, Shape(1))
[main] INFO ai.djl.pytorch.integration.gc.Main - no reference exists, and likely garbage collected ...
[main] INFO ai.djl.pytorch.integration.gc.Main - weakHashMap size: 0
\--- NDManager(e8ac5d45a276, gpu(0)) resource count: 1
    \--- NDManager(e847df9e9799, gpu(0)) resource count: 1
        \--- NDManager(db5ef3d00e38, gpu(0)) resource count: 0
\--- NDManager(e8ac5d45a276, gpu(0)) resource count: 1
    \--- NDManager(e847df9e9799, gpu(0)) resource count: 0
\--- NDManager(e8ac5d45a276, gpu(0)) resource count: 0

This test starts creating three NDArrays a,b,c. Because SwitchGarbageCollection.on(); the PtNDArrayImpl are wrapped with a dynamic proxy. The user references are actually to the dynamic proxy but the NDManager holds the PtNDArrayImpl. -> weakHashMap size: 3

When a,b,c are no longer referenced the dynamic proxies and the uid key are ready to be garbage collected. However, we need to wait for the next minor GC. Therefore still -> weakHashMap size: 3

To see a GC we do something we would not do in an application, we encourage GC and wait for 1 s. And in this case checkQueue got three messages from GC and closes the corresponding NDArrays. -> weakHashMap size: 0
I completely agree that calling checkQueue from size() in this example is something that would not happen in real life.

If I find time today I will set up an example for the testsuite.

For me, SwitchGarbageCollection.on(); makes a huge, positive difference in the application I run.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 11, 2023

...

1. clean up NDArray relies on calling other NDArray's methods. If the code only create NDArray but not calling any operators, the NDArray won't be closed

One workaround could be adding a NDManager.gc() function and let user manually call it when needed, it scans the current thread's orphan NDArrays and close them. But I personally don't like NDManager.gc() solution. A better alternative would be implement NDScope class:

try (NDScope scope = new NDScope()) {
     // Create NDArrays
} // NDArrays created in this scope will be closed automatically

If SwitchGarbageCollection.on(); is not called there could be NDArrays that are not closed: a leak. If you suffer from such a situation SwitchGarbageCollection.on(); could help by getting rid of the leaked NDArrays. I agree that the suggested solution relys on a method call to an NDArray - anyone but on the same thread - to do any cleanup. As you suggested one could complement this 'automatic process' with a manual trigger that the user may call in case the 'automatic trigger' does not apply. I could add a suggestion for your NDManager.gc().

Do you think that the NDScope approach works together with the GC?
For me it looks like a different approach ... would it fit together with the existing NDManager hierarchy?

@enpasos
Copy link
Contributor Author

enpasos commented Jan 11, 2023

I added a method gc() to NDManager that explicitly calls checkQueue on WeakHashMapWrapper. Just to show this function, I added a small Main2 corresponding to Main. Both classes should be deleted. A test suite #2290 should provide realistic test cases.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 11, 2023

As a starting point for a test suite #2290 I pasted some examples from the DJL Docs into

git clone https://github.com/enpasos/djlperformancetesting.git

and tested for duration and memory:

grafik

@enpasos
Copy link
Contributor Author

enpasos commented Jan 12, 2023

I thought about your

A better alternative would be implement NDScope class:

try (NDScope scope = new NDScope()) {
     // Create NDArrays
} // NDArrays created in this scope will be closed automatically

I think you could have this approach in principle without GC and without NDManager hierarchy.
You could add a mechanism to move an object from one NDScope to another enclosing or nested NDScope.
You could have full control over the lifecycle of the NDArray's part that lives on the devices.
One price you pay for this would be the clinch with the variable scope in Java. Suppose you have a long-lived variable and in an NDScope you create a new NDArray that is automatically attached to the variable while you are doing a calculation where you think of the value of the variable rather than the object. When you exit the NDScope, this new NDArray is closed, the Java object NDArray still exists, and the variable points to it. The next time you use the variable, where you want to work with the value, you get the equivalent to a NullPointerException. The compiler would not help you to find such errors. Maybe it is still better than getting memory leaks in running applications.

@frankfliu
Copy link
Contributor

@enpasos

  1. I don't like indeterministic behavior, it gives developer false impression that it works like GC
  2. NDManager.gc() only looks better to me, but it also a bit misleading, it only works for current thread, need document the behavior
  3. NDScope can have nested NDScope, the behavior has been implemented in Javacpp's PointerScope and it works pretty nice
  4. NDScope can be combined with NDManager hierarchy
  5. However NDScope is kind of redundant with NDManager.gc() solution, they can co-exist, but will cause confusion.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 12, 2023

@frankfliu Thx a lot for looking into this option. I totally agree that deterministic behavior is very desirable.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 14, 2023

I am closing this PR in favour of the deterministic solution #2321.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants