Specify the block that PtGradientCollector belongs to #2232

KexinFeng · 2022-12-15T18:47:00Z

The zeroGradients() call (added in #2101 to solve #2024) when constructing PtGradientCollector() compromises the efficiency and memory usage, as shown in #2210. Thus the block has to be specified so that the zeroGradients is called only on its parameters, and the search scope is narrowed down. Another relevant PR is #2111.

The original PtGradientCollector() is left unchanged in order to be backward compatible. A new PtGradientCollector(Block) is added so that the gradient collector is restricted to operate on a certain block. Then multiple such collectors can coexist. @zachgk I added this point to the doc, please take a look. The relevant PR #2111.

In essence, this PR is to disable zeroGradients() in the default case, unless the block is specified when creating a GradientCollector. This is because of the issues inside global search in zeroGradients.

To solve this issue the issue #2024 in the Dive into Deep Learning book, how the gradient is accumulated should be specified in the future, either "write" or "add". Also "retain_graph" should also be fixed. Under certain condition error should be thrown, for example, if retain_graph=false and repetative backward is called. This way the ambiguity in the defination of repetative backward call, as shown in #2024, will be clarified.

On pytorch, by default gradients are added for repetative backward call and if retain_graph=true

    x1 = pt.tensor([1., 2., 3., 4.], requires_grad=True)
    x2 = pt.tensor([5., 6., 7., 8.], requires_grad=True)
    u = x1 * x2
    z = u * x1
    z.sum().backward(retain_graph=True)
    print("\n")
    print(x1.grad, x2.grad)

    z.sum().backward(retain_graph=True)
    print(x1.grad, x2.grad)

    z.sum().backward()
    print(x1.grad, x2.grad)

Output:

tensor([10., 24., 42., 64.]) tensor([ 1.,  4.,  9., 16.])
tensor([ 20.,  48.,  84., 128.]) tensor([ 2.,  8., 18., 32.])
tensor([ 30.,  72., 126., 192.]) tensor([ 3., 12., 27., 48.])

Note:

If repetative backward needs to be called, retain_graph=True should be set. This is not an commenly seen usecase in practice. See also Test accumulating gradient collector #2111.
Block currently does not have a direct call feature like the following but requires parameterStore.

m = nn.Linear(20, 30)
input = torch.randn(128, 20)
output = m(input)

Relavant stackoverflow issue:
https://stackoverflow.com/questions/70998652/djl-gradientcollector-try-with-resources-initialiser-error

codecov-commenter · 2022-12-15T19:09:45Z

Codecov Report

Base: 72.08% // Head: 71.62% // Decreases project coverage by -0.46% ⚠️

Coverage data is based on head (b71f2ae) compared to base (bb5073f).
Patch coverage: 73.80% of modified lines in pull request are covered.

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #2232      +/-   ##
============================================
- Coverage     72.08%   71.62%   -0.47%     
- Complexity     5126     6373    +1247     
============================================
  Files           473      631     +158     
  Lines         21970    28172    +6202     
  Branches       2351     2998     +647     
============================================
+ Hits          15838    20179    +4341     
- Misses         4925     6523    +1598     
- Partials       1207     1470     +263

Impacted Files	Coverage Δ
api/src/main/java/ai/djl/modality/cv/Image.java	`69.23% <ø> (-4.11%)`	⬇️
...rc/main/java/ai/djl/modality/cv/MultiBoxPrior.java	`76.00% <ø> (ø)`
...rc/main/java/ai/djl/modality/cv/output/Joints.java	`71.42% <ø> (ø)`
.../main/java/ai/djl/modality/cv/output/Landmark.java	`100.00% <ø> (ø)`
...main/java/ai/djl/modality/cv/output/Rectangle.java	`72.41% <0.00%> (ø)`
...i/djl/modality/cv/translator/BigGANTranslator.java	`21.42% <0.00%> (-5.24%)`	⬇️
.../modality/cv/translator/ImageFeatureExtractor.java	`0.00% <0.00%> (ø)`
.../ai/djl/modality/cv/translator/YoloTranslator.java	`27.77% <0.00%> (+18.95%)`	⬆️
...ain/java/ai/djl/modality/cv/util/NDImageUtils.java	`59.21% <0.00%> (ø)`
api/src/main/java/ai/djl/modality/nlp/Decoder.java	`63.63% <ø> (ø)`
... and 564 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

engines/ml/xgboost/src/main/java/ai/djl/ml/xgboost/XgbEngine.java

zachgk · 2022-12-19T18:58:03Z

...ration/src/main/java/ai/djl/integration/tests/training/GradientCollectorIntegrationTest.java

-                try (GradientCollector gc = Engine.getInstance().newGradientCollector()) {
-                    NDArray b = a.mul(2);
-                    gc.backward(b);
+                try (GradientCollector gc = Engine.getInstance().newGradientCollector(block)) {


If we now support both newGradientCollector() and newGradientCollector(block), we should probably have tests for both of them

The test of newGradientCollector(block) has been covered in testClearGradients. Do we need to add it or do we change the name of this test to be, e.g. testNewGradientCollectorWithClearGradients?

zachgk · 2022-12-19T19:01:11Z

engines/pytorch/pytorch-engine/src/main/java/ai/djl/pytorch/engine/PtGradientCollector.java


-        zeroGradients();


I think we still want the zeroGradients here. Otherwise, it won't work correctly for cases without a block

The reason of removing it was mainly due to efficiency and memory issue #2210.
Should we add this back and then do

public GradientCollector newGradientCollector() { return manager.getEngine().newGradientCollector(getModel().getBlock()); }

Does this avoid the global search scope issue?

Update:
This code does not seem to work inside PtEngine.java or other Engine file.

api/src/main/java/ai/djl/training/GradientCollector.java

zachgk · 2022-12-19T19:05:02Z

api/src/main/java/ai/djl/engine/Engine.java

+     * @return a new instance of {@link GradientCollector}
+     */
+    public abstract GradientCollector newGradientCollector(Block block);
+
    /**
     * Returns a new instance of {@link ParameterServer}.


Can you also update trainer.newGradientCollector() to use this? You can change it's definition to:

public GradientCollector newGradientCollector() { return manager.getEngine().newGradientCollector(getModel().getBlock()); }

The tricky part is that inside the newGradientCollector(), the zeroGradient is not called, while inside newGradientCollector(Block block), it is. This is to avoid the efficiency and memory leak issue with global search scope as discussed last week.

Another consideration of not using

public GradientCollector newGradientCollector() { return manager.getEngine().newGradientCollector(getModel().getBlock()); }

is that the newGradientCollector() is definted to create a global GradientCollector, so the block is not specified.

Update:

public GradientCollector newGradientCollector() { return manager.getEngine().newGradientCollector(getModel().getBlock()); }

This code does not seem to work inside PtEngine.java or other Engine file.

Some thoughts after looking into code and testing:

I follow @zachgk argument that we need the zeroGradients() functionality when we get a new GradientCollector in any case. I don't see that we can remove it.

The time spend in zeroGradients() goes into getting all the NDArrays (see JProfiler snapshot). Therefore if not reducing the number of existing NDArrays before collecting the idea of knowing the relevant NDArrays for zeroGradients() would be an option to minder/remove the negative performance effect
. However, it is not removing the root cause.

As @zachgk mentioned ... in case you go with the approach it must also work when using a trainer.

Minor point: As there is the native JniUtils.zeroGrad((PtNDArray) array);. Couldn't it replace array.getGradient().subi(array.getGradient());?

Are you sure that you get all relevant NDArrays with Gradient by getting the parameter NDArrays from block?

As @zachgk mentioned ... in case you go with the approach it must also work when using a trainer.

That would be ideal. But it has the efficient and memory issue. So we should disable it for now.

Are you sure that you get all relevant NDArrays with Gradient by getting the parameter NDArrays from block?

The gradientCollector is specified to a block. By defination it searches only within a block.

KexinFeng changed the title ~~Temp patch that somewhat optimizes the original implementation in ter…~~ Temp patch that somewhat optimizes the memory usage Dec 15, 2022

KexinFeng changed the title ~~Temp patch that somewhat optimizes the memory usage~~ Temp patch that optimizes the memory usage Dec 15, 2022

KexinFeng changed the title ~~Temp patch that optimizes the memory usage~~ Implementation that optimizes the memory usage Dec 15, 2022

KexinFeng marked this pull request as ready for review December 15, 2022 18:50

KexinFeng requested review from zachgk and frankfliu as code owners December 15, 2022 18:50

KexinFeng changed the title ~~Implementation that optimizes the memory usage~~ Specify the block that PtGradientCollector belongs to Dec 16, 2022

KexinFeng mentioned this pull request Dec 19, 2022

memory leak and duration increase during training #2210

Closed

zachgk reviewed Dec 19, 2022

View reviewed changes

KexinFeng force-pushed the zeroGrad branch from 21e45de to 0fdfec9 Compare December 20, 2022 20:22

KexinFeng requested review from zachgk and enpasos December 27, 2022 20:36

KexinFeng mentioned this pull request Jan 6, 2023

memory leak on dataset iteration #2289

Closed

specify the block that PtGradientCollector belongs to

ef797a8

frankfliu force-pushed the zeroGrad branch from b71f2ae to ef797a8 Compare January 6, 2023 21:39

KexinFeng mentioned this pull request Jan 7, 2023

A temporary solution to issue 2210 #2304

Merged

frankfliu closed this Jan 10, 2023

KexinFeng deleted the zeroGrad branch April 14, 2023 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify the block that PtGradientCollector belongs to #2232

Specify the block that PtGradientCollector belongs to #2232

KexinFeng commented Dec 15, 2022 •

edited

Loading

codecov-commenter commented Dec 15, 2022 •

edited

Loading

zachgk Dec 19, 2022

KexinFeng Dec 19, 2022

zachgk Dec 19, 2022

KexinFeng Dec 19, 2022 •

edited

Loading

zachgk Dec 19, 2022

KexinFeng Dec 19, 2022 •

edited

Loading

KexinFeng Dec 19, 2022 •

edited

Loading

enpasos Dec 20, 2022 •

edited

Loading

KexinFeng Dec 20, 2022 •

edited

Loading

Specify the block that PtGradientCollector belongs to #2232

Specify the block that PtGradientCollector belongs to #2232

Conversation

KexinFeng commented Dec 15, 2022 • edited Loading

codecov-commenter commented Dec 15, 2022 • edited Loading

Codecov Report

zachgk Dec 19, 2022

Choose a reason for hiding this comment

KexinFeng Dec 19, 2022

Choose a reason for hiding this comment

zachgk Dec 19, 2022

Choose a reason for hiding this comment

KexinFeng Dec 19, 2022 • edited Loading

Choose a reason for hiding this comment

zachgk Dec 19, 2022

Choose a reason for hiding this comment

KexinFeng Dec 19, 2022 • edited Loading

Choose a reason for hiding this comment

KexinFeng Dec 19, 2022 • edited Loading

Choose a reason for hiding this comment

enpasos Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

KexinFeng Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

KexinFeng commented Dec 15, 2022 •

edited

Loading

codecov-commenter commented Dec 15, 2022 •

edited

Loading

KexinFeng Dec 19, 2022 •

edited

Loading

KexinFeng Dec 19, 2022 •

edited

Loading

KexinFeng Dec 19, 2022 •

edited

Loading

enpasos Dec 20, 2022 •

edited

Loading

KexinFeng Dec 20, 2022 •

edited

Loading