[feat] DeepNorm/DeepNet support (#227) #230

blefaudeux · 2022-03-08T04:35:21Z

What does this PR do?

Implements DeepNorm (aka DeepNet paper, 1000 layers Transformer) support, as a "layer norm style" config flag
cleans up the full model weight init

Open question : should we support something else than xavier init for the weights in the full model case ?

TODO

Implement cleanish support
Get some curves out on ViT/CIFAR and GPT

Before submitting

Did you have fun?
- Make sure you had fun coding 🙃
Did you read the contributor guideline?
Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you make sure to update the docs?
Did you write any new necessary tests?
Did you update the changelog? (if needed)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

blefaudeux · 2022-03-08T04:36:19Z

cc @jramapuram, I think that at minima the default weight init was probably not the best, but since it was overridden in your example it probably does not explain #219

following up on that with #230

codecov-commenter · 2022-03-08T06:08:21Z

Codecov Report

Merging #230 (7f0c810) into main (db1ce91) will increase coverage by 0.06%.
The diff coverage is 94.66%.

@@            Coverage Diff             @@
##             main     #230      +/-   ##
==========================================
+ Coverage   92.17%   92.24%   +0.06%     
==========================================
  Files          60       60              
  Lines        3247     3313      +66     
==========================================
+ Hits         2993     3056      +63     
- Misses        254      257       +3

Flag	Coverage Δ
Python	`92.24% <94.66%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
xformers/factory/model_factory.py	`95.76% <91.17%> (-1.92%)`	⬇️
xformers/factory/block_factory.py	`93.95% <94.73%> (+0.01%)`	⬆️
xformers/components/residual.py	`97.10% <100.00%> (+0.94%)`	⬆️
xformers/triton/layer_norm.py	`88.13% <0.00%> (+1.69%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db1ce91...7f0c810. Read the comment docs.

blefaudeux · 2022-03-08T06:37:24Z

example with the microGPT case: not too much change for 8 layers, but 12 layers and the pre-layer norm case NaNs, while with this PR ("DeepNorm") it's smooth sailing, as claimed by the paper

same hyperparams, just changing the layer_norm config

blefaudeux · 2022-03-08T06:38:01Z

Codecov Report

Merging #230 (33a5131) into main (f5c1d01) will decrease coverage by 1.11%.
The diff coverage is 41.66%.
@@            Coverage Diff             @@
##             main     #230      +/-   ##
==========================================
- Coverage   92.06%   90.94%   -1.12%     
==========================================
  Files          60       60              
  Lines        3227     3292      +65     
==========================================
+ Hits         2971     2994      +23     
- Misses        256      298      +42     
Flag Coverage Δ
Python 90.94% <41.66%> (-1.12%) arrow_down

Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files Coverage Δ
xformers/factory/model_factory.py 78.63% <30.30%> (-19.05%) arrow_down
xformers/components/residual.py 80.00% <50.00%> (-15.35%) arrow_down
xformers/factory/block_factory.py 89.56% <52.63%> (-4.38%) arrow_down

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5c1d01...33a5131. Read the comment docs.

extended unit test since that, should be better now

blefaudeux · 2022-03-08T06:43:14Z

ViT/CIFAR example, pre/post/deep norm, same hyper params. The validation accuracy seems likely to end up close (kind of a dummy example), but here again the dynamics seem a fair bit better with deepnorm

blefaudeux · 2022-03-08T06:43:52Z

examples/microViT.py

@@ -248,7 +248,7 @@ def test_step(self, batch, _):

    # compute total number of steps
    batch_size = BATCH * GPUS
-    steps = dm.num_samples // batch_size * MAX_EPOCHS


this should have been fixed earlier, I had a PR stack and somehow lost it with rebases..

blefaudeux · 2022-03-08T06:44:47Z

tests/test_pytorch_transformer_parity.py

@@ -172,7 +172,7 @@ def test_pytorch_tranformer_parity(device=torch.device("cuda")):
            dim_feedforward=4 * EMB,
            dropout=DROP,
            activation=ACTIVATION,
-            layer_norm_eps=1e-05,


should have been part of #221 (1e-6 is the default everywhere now), unrelated to this PR but I spotted that while writing this one

blefaudeux · 2022-03-08T06:45:11Z

xformers/components/residual.py



 # CREDITS: the following is inspired by FastAI's Transformer implementation
 class Residual(nn.Module):
-    """Object-oriented handling of the residual path"""
+    """Object-oriented handling of the residual path.


PR change 1: support scaling the residual path

blefaudeux · 2022-03-08T06:45:48Z

xformers/components/residual.py

+DeepNormCoefficients = namedtuple("DeepNormCoefficients", ["alpha", "beta"])
+
+
+def get_deepnorm_coefficients(


PR change 2: get the residual scaling and init scaling given the whole model, following the paper

blefaudeux · 2022-03-08T06:46:09Z

xformers/factory/block_factory.py

-                if layer_norm_style == LayerNormStyle.Pre
-                else PostNorm(d_model, Residual(sublayer), use_triton)
-            )
+            if layer_norm_style == LayerNormStyle.Pre:


PR change 3: handle 3 layernorm options

blefaudeux · 2022-03-08T06:46:30Z

xformers/factory/model_factory.py

    @classmethod
    def from_config(cls, config: xFormerConfig):
        return cls(config.stack_configs, config.tie_embedding_weights)

+    def _deepnorm_weight_init(self):


PR change 4: handle the required init change, as per the paper

blefaudeux · 2022-03-08T06:47:33Z

cc @stephenroller @suchenzang, just in case you're interested. Small scale tests but seems to confirm the paper indeed

blefaudeux · 2022-03-11T20:06:01Z

rebased on top of #233, I'll redo some curves for a better comparison

blefaudeux · 2022-03-11T20:57:18Z

rebased on top of #233, I'll redo some curves for a better comparison

it does not change the accuracy difference really, the NaN case for pre-norm seems gone, but there's still an accuracy difference in that case in favor of deepnorm

blefaudeux · 2022-03-13T16:46:12Z

ping reviewers @fmassa @dianaml0 @jieru-hu

blefaudeux · 2022-03-14T18:02:22Z

examples/microViT.py

@@ -33,7 +33,7 @@ class VisionTransformer(pl.LightningModule):
    def __init__(
        self,
        steps,
-        learning_rate=1e-3,


this should have been part of #234, lost in a rebase I guess. It's on purpose, seems to work well

dianaml0

Great to have this! :) 🚀

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 8, 2022

blefaudeux marked this pull request as draft March 8, 2022 04:35

blefaudeux added a commit that referenced this pull request Mar 8, 2022

[hotfix] Init typo for decoders (#229)

f5c1d01

following up on that with #230

blefaudeux force-pushed the feat_227 branch from 5b244d2 to c91cda5 Compare March 8, 2022 04:54

blefaudeux linked an issue Mar 8, 2022 that may be closed by this pull request

[feat] Add DeepNorm/DeepNet residual path #227

Closed

blefaudeux force-pushed the feat_227 branch 3 times, most recently from f977704 to 33a5131 Compare March 8, 2022 05:40

blefaudeux commented Mar 8, 2022

View reviewed changes

blefaudeux requested review from jieru-hu, dianaml0 and fmassa and removed request for jieru-hu March 8, 2022 06:58

blefaudeux changed the title ~~[DRAFT] DeepNorm support (#227)~~ [feat] DeepNorm support (#227) Mar 8, 2022

blefaudeux marked this pull request as ready for review March 8, 2022 06:58

blefaudeux changed the title ~~[feat] DeepNorm support (#227)~~ [feat] DeepNorm/DeepNet support (#227) Mar 9, 2022

blefaudeux force-pushed the feat_227 branch from 565cd76 to 81effde Compare March 11, 2022 20:02

blefaudeux mentioned this pull request Mar 12, 2022

[Minor] Better microViT settings #234

Merged

6 tasks

Should be good to go but needs testing

43dbe69

blefaudeux force-pushed the feat_227 branch from 81effde to c193f4d Compare March 12, 2022 01:30

adding a unit test + assert

67ce7ac

blefaudeux force-pushed the feat_227 branch from c193f4d to 67ce7ac Compare March 13, 2022 20:55

blefaudeux commented Mar 14, 2022

View reviewed changes

dianaml0 approved these changes Mar 14, 2022

View reviewed changes

removing default values which could become a footgun

7f0c810

blefaudeux merged commit c8baac0 into main Mar 14, 2022

blefaudeux deleted the feat_227 branch March 20, 2022 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] DeepNorm/DeepNet support (#227) #230

[feat] DeepNorm/DeepNet support (#227) #230

blefaudeux commented Mar 8, 2022 •

edited

Loading

blefaudeux commented Mar 8, 2022

codecov-commenter commented Mar 8, 2022 •

edited

Loading

blefaudeux commented Mar 8, 2022

blefaudeux commented Mar 8, 2022

Codecov Report

blefaudeux commented Mar 8, 2022

blefaudeux Mar 8, 2022

blefaudeux Mar 8, 2022

blefaudeux Mar 8, 2022

blefaudeux Mar 8, 2022

blefaudeux Mar 8, 2022

blefaudeux Mar 8, 2022

blefaudeux commented Mar 8, 2022

blefaudeux commented Mar 11, 2022

blefaudeux commented Mar 11, 2022

blefaudeux commented Mar 13, 2022

blefaudeux Mar 14, 2022

dianaml0 left a comment

		DeepNormCoefficients = namedtuple("DeepNormCoefficients", ["alpha", "beta"])


		def get_deepnorm_coefficients(

[feat] DeepNorm/DeepNet support (#227) #230

[feat] DeepNorm/DeepNet support (#227) #230

Conversation

blefaudeux commented Mar 8, 2022 • edited Loading

What does this PR do?

Before submitting

PR review

blefaudeux commented Mar 8, 2022

codecov-commenter commented Mar 8, 2022 • edited Loading

Codecov Report

blefaudeux commented Mar 8, 2022

blefaudeux commented Mar 8, 2022

Codecov Report

blefaudeux commented Mar 8, 2022

blefaudeux Mar 8, 2022

Choose a reason for hiding this comment

blefaudeux Mar 8, 2022

Choose a reason for hiding this comment

blefaudeux Mar 8, 2022

Choose a reason for hiding this comment

blefaudeux Mar 8, 2022

Choose a reason for hiding this comment

blefaudeux Mar 8, 2022

Choose a reason for hiding this comment

blefaudeux Mar 8, 2022

Choose a reason for hiding this comment

blefaudeux commented Mar 8, 2022

blefaudeux commented Mar 11, 2022

blefaudeux commented Mar 11, 2022

blefaudeux commented Mar 13, 2022

blefaudeux Mar 14, 2022

Choose a reason for hiding this comment

dianaml0 left a comment

Choose a reason for hiding this comment

blefaudeux commented Mar 8, 2022 •

edited

Loading

codecov-commenter commented Mar 8, 2022 •

edited

Loading