Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify the block that PtGradientCollector belongs to #2232

Closed
wants to merge 1 commit into from

Conversation

KexinFeng
Copy link
Contributor

@KexinFeng KexinFeng commented Dec 15, 2022

The zeroGradients() call (added in #2101 to solve #2024) when constructing PtGradientCollector() compromises the efficiency and memory usage, as shown in #2210. Thus the block has to be specified so that the zeroGradients is called only on its parameters, and the search scope is narrowed down. Another relevant PR is #2111.

The original PtGradientCollector() is left unchanged in order to be backward compatible. A new PtGradientCollector(Block) is added so that the gradient collector is restricted to operate on a certain block. Then multiple such collectors can coexist. @zachgk I added this point to the doc, please take a look. The relevant PR #2111.

In essence, this PR is to disable zeroGradients() in the default case, unless the block is specified when creating a GradientCollector. This is because of the issues inside global search in zeroGradients.


To solve this issue the issue #2024 in the Dive into Deep Learning book, how the gradient is accumulated should be specified in the future, either "write" or "add". Also "retain_graph" should also be fixed. Under certain condition error should be thrown, for example, if retain_graph=false and repetative backward is called. This way the ambiguity in the defination of repetative backward call, as shown in #2024, will be clarified.

On pytorch, by default gradients are added for repetative backward call and if retain_graph=true

    x1 = pt.tensor([1., 2., 3., 4.], requires_grad=True)
    x2 = pt.tensor([5., 6., 7., 8.], requires_grad=True)
    u = x1 * x2
    z = u * x1
    z.sum().backward(retain_graph=True)
    print("\n")
    print(x1.grad, x2.grad)

    z.sum().backward(retain_graph=True)
    print(x1.grad, x2.grad)

    z.sum().backward()
    print(x1.grad, x2.grad)

Output:

tensor([10., 24., 42., 64.]) tensor([ 1.,  4.,  9., 16.])
tensor([ 20.,  48.,  84., 128.]) tensor([ 2.,  8., 18., 32.])
tensor([ 30.,  72., 126., 192.]) tensor([ 3., 12., 27., 48.])

Note:

  • If repetative backward needs to be called, retain_graph=True should be set. This is not an commenly seen usecase in practice. See also Test accumulating gradient collector #2111.
  • Block currently does not have a direct call feature like the following but requires parameterStore.
m = nn.Linear(20, 30)
input = torch.randn(128, 20)
output = m(input)

Relavant stackoverflow issue:
https://stackoverflow.com/questions/70998652/djl-gradientcollector-try-with-resources-initialiser-error

@KexinFeng KexinFeng changed the title Temp patch that somewhat optimizes the original implementation in ter… Temp patch that somewhat optimizes the memory usage Dec 15, 2022
@KexinFeng KexinFeng changed the title Temp patch that somewhat optimizes the memory usage Temp patch that optimizes the memory usage Dec 15, 2022
@KexinFeng KexinFeng changed the title Temp patch that optimizes the memory usage Implementation that optimizes the memory usage Dec 15, 2022
@KexinFeng KexinFeng marked this pull request as ready for review December 15, 2022 18:50
@codecov-commenter
Copy link

codecov-commenter commented Dec 15, 2022

Codecov Report

Base: 72.08% // Head: 71.62% // Decreases project coverage by -0.46% ⚠️

Coverage data is based on head (b71f2ae) compared to base (bb5073f).
Patch coverage: 73.80% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2232      +/-   ##
============================================
- Coverage     72.08%   71.62%   -0.47%     
- Complexity     5126     6373    +1247     
============================================
  Files           473      631     +158     
  Lines         21970    28172    +6202     
  Branches       2351     2998     +647     
============================================
+ Hits          15838    20179    +4341     
- Misses         4925     6523    +1598     
- Partials       1207     1470     +263     
Impacted Files Coverage Δ
api/src/main/java/ai/djl/modality/cv/Image.java 69.23% <ø> (-4.11%) ⬇️
...rc/main/java/ai/djl/modality/cv/MultiBoxPrior.java 76.00% <ø> (ø)
...rc/main/java/ai/djl/modality/cv/output/Joints.java 71.42% <ø> (ø)
.../main/java/ai/djl/modality/cv/output/Landmark.java 100.00% <ø> (ø)
...main/java/ai/djl/modality/cv/output/Rectangle.java 72.41% <0.00%> (ø)
...i/djl/modality/cv/translator/BigGANTranslator.java 21.42% <0.00%> (-5.24%) ⬇️
.../modality/cv/translator/ImageFeatureExtractor.java 0.00% <0.00%> (ø)
.../ai/djl/modality/cv/translator/YoloTranslator.java 27.77% <0.00%> (+18.95%) ⬆️
...ain/java/ai/djl/modality/cv/util/NDImageUtils.java 59.21% <0.00%> (ø)
api/src/main/java/ai/djl/modality/nlp/Decoder.java 63.63% <ø> (ø)
... and 564 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@KexinFeng KexinFeng changed the title Implementation that optimizes the memory usage Specify the block that PtGradientCollector belongs to Dec 16, 2022
try (GradientCollector gc = Engine.getInstance().newGradientCollector()) {
NDArray b = a.mul(2);
gc.backward(b);
try (GradientCollector gc = Engine.getInstance().newGradientCollector(block)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we now support both newGradientCollector() and newGradientCollector(block), we should probably have tests for both of them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test of newGradientCollector(block) has been covered in testClearGradients. Do we need to add it or do we change the name of this test to be, e.g. testNewGradientCollectorWithClearGradients?


zeroGradients();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still want the zeroGradients here. Otherwise, it won't work correctly for cases without a block

Copy link
Contributor Author

@KexinFeng KexinFeng Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason of removing it was mainly due to efficiency and memory issue #2210.
Should we add this back and then do

    public GradientCollector newGradientCollector() {
        return manager.getEngine().newGradientCollector(getModel().getBlock());
    }

Does this avoid the global search scope issue?

Update:
This code does not seem to work inside PtEngine.java or other Engine file.

api/src/main/java/ai/djl/training/GradientCollector.java Outdated Show resolved Hide resolved
* @return a new instance of {@link GradientCollector}
*/
public abstract GradientCollector newGradientCollector(Block block);

/**
* Returns a new instance of {@link ParameterServer}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update trainer.newGradientCollector() to use this? You can change it's definition to:

    public GradientCollector newGradientCollector() {
        return manager.getEngine().newGradientCollector(getModel().getBlock());
    }

Copy link
Contributor Author

@KexinFeng KexinFeng Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky part is that inside the newGradientCollector(), the zeroGradient is not called, while inside newGradientCollector(Block block), it is. This is to avoid the efficiency and memory leak issue with global search scope as discussed last week.

Copy link
Contributor Author

@KexinFeng KexinFeng Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another consideration of not using

    public GradientCollector newGradientCollector() {
        return manager.getEngine().newGradientCollector(getModel().getBlock());
    }

is that the newGradientCollector() is definted to create a global GradientCollector, so the block is not specified.


Update:

    public GradientCollector newGradientCollector() {
        return manager.getEngine().newGradientCollector(getModel().getBlock());
    }

This code does not seem to work inside PtEngine.java or other Engine file.

Copy link
Contributor

@enpasos enpasos Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts after looking into code and testing:

  • I follow @zachgk argument that we need the zeroGradients() functionality when we get a new GradientCollector in any case. I don't see that we can remove it.
  • The time spend in zeroGradients() goes into getting all the NDArrays (see JProfiler snapshot). Therefore if not reducing the number of existing NDArrays before collecting the idea of knowing the relevant NDArrays for zeroGradients() would be an option to minder/remove the negative performance effect
    grafik. However, it is not removing the root cause.
  • As @zachgk mentioned ... in case you go with the approach it must also work when using a trainer.
  • Minor point: As there is the native JniUtils.zeroGrad((PtNDArray) array);. Couldn't it replace array.getGradient().subi(array.getGradient());?
  • Are you sure that you get all relevant NDArrays with Gradient by getting the parameter NDArrays from block?

Copy link
Contributor Author

@KexinFeng KexinFeng Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @zachgk mentioned ... in case you go with the approach it must also work when using a trainer.

That would be ideal. But it has the efficient and memory issue. So we should disable it for now.

Are you sure that you get all relevant NDArrays with Gradient by getting the parameter NDArrays from block?

The gradientCollector is specified to a block. By defination it searches only within a block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants