Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements NDScope to automatically close NDArray in the scope #2321

Merged
merged 10 commits into from
Jan 26, 2023

Conversation

enpasos
Copy link
Contributor

@enpasos enpasos commented Jan 14, 2023

This PR introduces - as suggest by @frankfliu - reference counting to prevent or get rid of memory leaks on the native side of NDArrays. As suggested by @frankfliu the implementation is derived from the well established reference counting code of JavaCPP. It is a deterministic approach and does not rely on Java garbage collection as the approach #2273.

The user interface - in this first attempt - is already easy to use thanks to the design by JavaCPP's author Samuel Audet:

  • the feature is opt-in by design
  • it has a low-level UI - here in RCObject (= reference counted object).
  • and a high-level UI - here in RCScope (= reference counter scope).

The approach works loosely together with the NDManager hierarchy - with potential for closer integration.

At the moment the feature is implemented for the PyTorch NativeResource PtNDArray. The extension to other NativeResources should be simple.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 14, 2023

This PR can be used to solve the problem #2210. As an illustration:

grafik
Example App1 in implementing introductory DJL Doc Example

git clone https://github.com/enpasos/djlperformancetesting.git
cd djlperformancetesting
git checkout refcount

To repair the memory leak (red situation in the picture) the only code that had to be changed is introducing a RCScope (green situation in the picture) :

try (RCScope scope = new RCScope()) {
  // the training code running inside a loop
}

@KexinFeng The temporary solution #2304 that fixed the performance impact of this memory leak could potentially be removed after this PR.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 14, 2023

Garbage collection and reference queues do not provide efficient real-time guarantee of unreachability, while reference-counting provides an alternative mechanism at the cost of slight inconvenience. [source ... from netty]

One cost of slight inconvenience: With the high level interface RCScope you may be lucky that you just have to set the RCScopes right and that's it. But if you are not so lucky a NDArray that you closed implicitly when leaving a RCScope might be needed later on. Fortunately you very likely recognize this at programming time when you run the code and get an exception from NativeResource when the handle is no more there: "Native resource has been released already". At this point the user needs some help to identify the particular NDArray that is needed later on. One option would be to allow him to start the application again with getting some more info, such as the stacktrace of the NDArray at its creation time ..(edited 17.01.2023) and the closing time. ... (edited 15.01.2023 after implementing a first solution that I practically used so solve such a problem in an application) Setting the following switch at the start of your app you get the needed info.

RCConfig.setVerboseIfResourceAlreadyClosed(true);

@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2023

Codecov Report

Base: 72.08% // Head: 74.36% // Increases project coverage by +2.27% 🎉

Coverage data is based on head (d0530d0) compared to base (bb5073f).
Patch coverage: 75.21% of modified lines in pull request are covered.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2321      +/-   ##
============================================
+ Coverage     72.08%   74.36%   +2.27%     
- Complexity     5126     6773    +1647     
============================================
  Files           473      664     +191     
  Lines         21970    29403    +7433     
  Branches       2351     3053     +702     
============================================
+ Hits          15838    21866    +6028     
- Misses         4925     6048    +1123     
- Partials       1207     1489     +282     
Impacted Files Coverage Δ
api/src/main/java/ai/djl/modality/cv/Image.java 69.23% <ø> (-4.11%) ⬇️
...rc/main/java/ai/djl/modality/cv/MultiBoxPrior.java 76.00% <ø> (ø)
...rc/main/java/ai/djl/modality/cv/output/Joints.java 71.42% <ø> (ø)
.../main/java/ai/djl/modality/cv/output/Landmark.java 100.00% <ø> (ø)
...main/java/ai/djl/modality/cv/output/Rectangle.java 72.41% <0.00%> (ø)
...i/djl/modality/cv/translator/BigGANTranslator.java 21.42% <0.00%> (-5.24%) ⬇️
.../modality/cv/translator/ImageFeatureExtractor.java 0.00% <0.00%> (ø)
.../ai/djl/modality/cv/translator/YoloTranslator.java 27.77% <0.00%> (+18.95%) ⬆️
...ain/java/ai/djl/modality/cv/util/NDImageUtils.java 67.10% <0.00%> (+7.89%) ⬆️
api/src/main/java/ai/djl/modality/nlp/Decoder.java 63.63% <ø> (ø)
... and 608 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 15, 2023

Performance looks good now. I have looked for hot spots using JProfiler. The following screenshot from an example application that heavily uses the reference counting code (by using the high-level UI).

grafik

grafik

grafik

None of the methods that consume more than 1% of the time belong to the reference counting code.
And no more memory leak.

@enpasos enpasos marked this pull request as ready for review January 15, 2023 15:56
@enpasos enpasos marked this pull request as draft January 15, 2023 15:58
@enpasos enpasos marked this pull request as ready for review January 15, 2023 16:53
@KexinFeng
Copy link
Contributor

KexinFeng commented Jan 16, 2023

@KexinFeng The temporary solution #2304 that fixed the performance impact of this memory leak could potentially be removed after this PR.

Thanks for this contribution. Very cool idea. Also thanks for mentioning this. But the issue PR #2304 originates from the issue #2024 regarding the setting of gradient accumulation: add or overwrite. Regarding the subsequent issue #2210 you opened, the inefficiency there is due to that the searching scope is too large. So probably this PR still wouldn't heal it.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 16, 2023

@KexinFeng Thx for looking into it. I really love how this reference counting deterministically works and solves the memory leak problem ... but the idea comes from your team: @lanking520 referenced almson-refcount and @frankfliu referenced the JavaCPP with the easy to use Scope approach. The approach can work completely without relying on GC and therefore likely avoids problems @zachgk mentioned.

Concerning PR #2304 ... you are the expert. I just wanted to mention it because I had the impression that the too large searching scope did not hurt until the memory-leaking NDArrays came into play.

@KexinFeng
Copy link
Contributor

@enpasos A quick question about

This PR can be used to solve the problem #2210. As an illustration: [picture]

does this PR also solves the efficiency problem that was found in that same issue #2210?

If searching scope being too large were really the root cause, then probably the efficiency issue would still be there. But if not, there might be a different root cause.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 17, 2023

@enpasos A quick question about

This PR can be used to solve the problem #2210. As an illustration: [picture]

does this PR also solves the efficiency problem that was found in that same issue #2210?

If searching scope being too large were really the root cause, then probably the efficiency issue would still be there. But if not, there might be a different root cause.

@KexinFeng Thx for asking. I didn't test it before ... and with the test I ran into a bug in the PtNDArrayDeallocator. After fixing this bug I see

grafik
On the left picture: red is covered by blue. On the right picture green almost covers blue

My interpretation: With the "temporary solution" the performance impact is fixed but the causing memory leak is still there. With "reference counting but not temporary solution" the memory leak is fixed and the caused performance impact is fixed, too. The performance numbers of "temporary solution" are slightly better then "reference counting but not temporary solution", so "reference counting with temporary solution" should give the best in terms of memory leak and performance impact. But if there are other reasons you would like to remove the "temporary solution" again, it is not strictly needed to solve issue #2210.

@KexinFeng
Copy link
Contributor

KexinFeng commented Jan 17, 2023

Thanks so much for these test results! It looks like reference counting without temporary solution indeeds solves the issue, the reason of which probably is not known until the true root cause of the previous issue is found.

From these test results, there are also other bizzare issues. The first is that, temporary the solution #2304 is nothing but the revertion of the features PR 2111 and PR2101. So the results indicate that DJL has memory leak issues already before PR 2111 and PR2101. One check is whether it is due to the PtNDArray indexing memory leak issue, which was solved later than this in #2300. On the other hand, if this was indeed the root cause, or if anyother root cause were indentified, then how the reference counting featue solves these issues would be an interesting example to demonstrate its advantage.

The second is that from the efficiency plot, it looks like the global searching doesn't necessarily increase the duration. Probably it is relavant to memory leak, so when reference counting feature is applied, it is solved at the same time.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 17, 2023

@KexinFeng The introductory DJL Doc Example as I put it together as an example is for my understanding a from scratch and not a very sophisticated approach - but the memory leaking mistakes made there are on the level I make them - as an example for a DJL user. One "memory leak" there: the params belong to the NDManager manager just below the SystemManager. When the stochastic gradient descent function inside the training loop is called there are intermediate result NDArrays created - such as param.getGradient().mul(lr) that also belong to manager. They are no more used but they are not closed. And it happens again and again on each training loop. With the reference counting scope approach this is easy to solve.

I think this memory leak is not related to the indexing memory leak issue, solved in #2300

@KexinFeng
Copy link
Contributor

Hi @enpasos, thanks again for your very nice contribution. We have had several team meetings to discuss your PRs, including this one and the preceding one #2273, and we were inspired by your memory management idea and the relevant Java language knowledge. More specifically, we do agree that your implementation of using RCScope will work. And, most importantly, this feature indeed frees the users from worrying about which NDManager to attach when doing NDArray computations, which is a great improvement. On the other hand, to keep DJL elegant, we have also simplified your implementation. It turns out that the essential implementation is shown in the following two files: https://gist.github.com/KexinFeng/2ad44daf14f77bc45c7b4c7fd8cf8f33. And since we have NDManager already, the reference count won't be essential, thus won't be kept. So this means that a big chunk of your implementation will not be included in this simplified version.

So can we propose the following? We can force-push our simplified version to this PR, so that the feature is credited to you. But before this, you can first backup your current implementation, and if you still want, you can create a new PR and put them there. What do you think?

@enpasos
Copy link
Contributor Author

enpasos commented Jan 26, 2023

Perfect! Many thanks to @KexinFeng, @frankfliu and the DJL team! The solution you found is just right to completely solve my memory leak problems. It is elegant per se and in its integration with DJL.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 26, 2023

Garbage collection and reference queues do not provide efficient real-time guarantee of unreachability, while reference-counting provides an alternative mechanism at the cost of slight inconvenience. [source ... from netty]

One cost of slight inconvenience: With the high level interface RCScope you may be lucky that you just have to set the RCScopes right and that's it. But if you are not so lucky a NDArray that you closed implicitly when leaving a RCScope might be needed later on. Fortunately you very likely recognize this at programming time when you run the code and get an exception from NativeResource when the handle is no more there: "Native resource has been released already". At this point the user needs some help to identify the particular NDArray that is needed later on. One option would be to allow him to start the application again with getting some more info, such as the stacktrace of the NDArray at its creation time ..(edited 17.01.2023) and the closing time. ... (edited 15.01.2023 after implementing a first solution that I practically used so solve such a problem in an application) Setting the following switch at the start of your app you get the needed info.

RCConfig.setVerboseIfResourceAlreadyClosed(true);

I am missing the feature to find a specific NDArray when I get the error "Native resource has been released already". As this is a feature, I would follow @KexinFeng's suggestion to add a small PR when this PR is on the master.

@enpasos
Copy link
Contributor Author

enpasos commented Jan 26, 2023

When I first tested your solution I took the two files from KexinFeng's link and hooked them in by NDScope.register from the constructors of PtNDArray. Then I run my application and :-) ... no exceptions and the memory leak was away. Then I recognized there already was the integrated solution pushed. I checked it out, ran with my application into minor bugs which I fixed. But then I recognized that the memory leak was back again.

The leak was in the detach Method of PtNDArray - somehow the NDScope.unregister in this method was causing the leak. Removing the NDScope.unregister the leak is away. What happened? ... maybe I was attaching the NDArray to a different NDManager, the detach on the old NDManager was called internally and the NDArray was unregistered from the NDScope and on closing the NDScope the NDArray was not closed.

To really get rid of memory leaks where the user has no clue where they come from, it is in my opinion crucial that NDScope closes the NDArrays created inside it, no matter what is done inside NDScope, unless NDScope.unregister is explicitly called.

So I think the NDScope.unregister must not be in the detach method. I hope it is ok that I have removed it from there.

}

detached = manager.create(new int[] {1});
detached.detach(); // detached from NDManager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In most cases, when NDArray.detach() is called, user want to return the NDArray from the function.
  2. And usually, NDManager is at the out side if NDScope

Which means user has to call both function in most of time. And we documented that once NDArray is detached, user must manually close it.

So I feel we should unregister from the NDScope when NDArray.detach() is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are abusing NDArray.detach() in many internal place (attache(), they should only detach from NDManager only. Introducing two detach(boolean) function seems too much. Requires user to call both should be fine for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your intention to introduce NDScope?

Copy link
Contributor Author

@enpasos enpasos Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention would be: Implement it because you need it to close a memory leak that you cannot solve otherwise. So I would not use NDScope if I did not have to. But if I have a leak - which I unfortunately do - I would put the NDScope where I think no NDArray should leak through - no matter if inside or outside of NDManager and no matter how they are used. And in that case I would not want NDArrays to leak through for whatever reason.

Copy link
Contributor Author

@enpasos enpasos Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are abusing NDArray.detach() in many internal place (attache(), they should only detach from NDManager only. Introducing two detach(boolean) function seems too much. Requires user to call both should be fine for now.

I agree that if there was a convenient method where the user knew exactly that he/she was also unregistering from NDScope, there should be no problem.

Copy link
Contributor Author

@enpasos enpasos Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave it as it is for this PR?

if (((NDArray) obj).isReleased()) {
return this == obj;
}
if (((NDArray) obj).getManager() != manager) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equals() compares the content of two NDArray, if their NDManager are different, their content still can be equals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NDScope.unregister calls resources.remove which uses equals to identify the object.
I got an exception from unregister in two cases: one object (not the one to be unregistered) in NDScope was on a different Device and one was already closed. The lines I added - maybe not perfect - were intended to solve both cases.

Copy link
Contributor

@frankfliu frankfliu Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we should remove the NDArray by it's uid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we should remove the NDArray by it's uid

I think this is not save because the uid from the handle is only unique as long as the handle lives. If you create a uuid by uid plus handleCreationTimestamp it would do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, uid meant to be unique, but it's possible have a clash if the NDArray is closed and the same memory get reused. We should add a timestamp to it.

Copy link
Contributor Author

@enpasos enpasos Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problems I experience where Exceptions coming from the equals-Method. That is not part of the equals method contract - it should answer yes or no.
I agree that if you want to have the contract on NDScope on the same instance of the NDArray - a clear contract - you need to solve it in NDScope.
A simple wrapper of NDArray would do, which is only used inside NDScope to put into the List. It should just inherit the equals implementation from Object to detect the instance on removing from the List.

Copy link
Contributor

@frankfliu frankfliu Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's separate in different issues:

  1. NDScope should track the object itself, the ArrayList might not be the right collection, we can use IdentityHashMap to resolve this problem
  2. uid is not unique. we can solve this problem in a different PR
  3. equals() function: isReleased() is a bug, we can fixed it here or in a different PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I comitted a suggestion, but I think it is not correct ... NDArrayWrapper and NDArrayList it a problem ... I will fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. IdentityHashMap works nicely. Better solution than my wrapper suggestion ... and using NDArrayList with the wrapper is simply wrong.

Copy link
Contributor Author

@enpasos enpasos Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point 1. I think done.
Points 2. and 3. in a separate PR sounds good.

enpasos and others added 2 commits January 26, 2023 18:44
@frankfliu frankfliu changed the title WIP: reference counting Implements NDScope to automatically close NDArray in the scope Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants