Add Optimized GPT #210

nik-mosaic · 2023-03-02T19:30:18Z

Adds code for an optimized version of MosaicGPT. The user can specify cfg.gpt_block = 'optimized' to run this code, and cfg.gpt_block = 'standard' (or omit it) to run the standard code.

This PR contains:

In src/mosaic_gpt.py and src/models/layers/gpt_blocks.py: Modeling changes to enable the optimized GPT: moving first layernorm of GPT blocks 2 - n to the end of blocks 1 - n-1, and replacing the final layernorm with ln_f with ln_i. These should be math-equivalent, so if you choose not to use cfg.gpt_block = 'optimized', your model will be unchanged.
In src/models/layers/gpt_blocks.py: The new OptimizedGPTBlock itself.
In src/mosaic_gpt.py: An option to use Fused Cross Entropy, which is enabled by default, installed via the standard pip install .[llm]. Otherwise, you can set cfg.loss_fn=torch_crossentropy to disable this and use standard torch.nn.CrossEntropy().
In the csrc/ folder: C++/CUDA code for the one custom fusion we include. Since the HazyResearch DropoutAddLayerNorm does not support 30B and 70B models, we add code to support these models and require the user install it.
In llm/README.md and llm/csrc/README.md: Installation instructions for dependencies for optimized MosaicGPT.

Along for the ride:

Changes all vocab_sizes in YAMLs to 50304 (icl_evals/yamls, yamls/mosaic_gpt/, mcloud/)

Future Work:

FusedMLP has been removed since it doesn't help performance on bf16 for CUDA < 11.8. In the future, when we support PyTorch 2.0 + CUDA 11.8, I will retest this.
After this PR, I will make a pull request in the HazyResearch repo to add DropoutAddLayerNorm support for 30B and 70B model sizes . Then, we can remove the csrc/ folder and include DropoutAddLayerNorm as a single line in a requirements file.
Bake in DropoutAddLayerNorm into our docker images, like we've done with FlashAttention, so that it does not need to build every time (~20 min. build time).

dakinggg

This looks good to me as best I can tell. I did not review the "adapted/inspired" files. Let me know if there is anything specific in there you would like me to review. Would like to get another set of eyes for approval.

examples/llm/README.md

examples/llm/csrc/README.md

examples/llm/csrc/fused_dense_lib/README.md

examples/llm/csrc/fused_dense_lib/fused_dense.cpp

examples/llm/yamls/mosaic_gpt/1b.yaml

examples/llm/src/models/ops/layer_norm.py

examples/llm/src/models/ops/fused_dense.py

examples/llm/src/models/mosaic_gpt.py

examples/llm/src/models/layers/gpt_blocks.py

dakinggg · 2023-03-03T19:50:22Z

For testing, could you please show plots of

gpt-125 without this pr
gpt-125 standard with this pr
gpt-125 optimized with this pr

nik-mosaic · 2023-03-03T19:54:07Z

Thanks for the review. Generating the plots was my plan, but with 1B-sized models. I am currently resolving some exceptions that occur with the optimized blocks. All your comments are correct; I will make those readability and style fixes shortly.

dakinggg · 2023-03-03T20:00:59Z

Sounds good, 1B model is fine too

examples/llm/README.md

examples/llm/src/models/layers/gpt_blocks.py

examples/llm/src/models/mosaic_gpt.py

examples/llm/src/models/layers/gpt_blocks.py

examples/llm/src/models/mosaic_gpt.py

nik-mosaic · 2023-03-20T19:54:30Z

Zooming in, we see that the loss curves for the standard and optimized blocks are within the non-determinism margin of variation. Shown in light and dark blue are two trials of the 125m model training run off the main branch.

vchiley · 2023-03-21T15:16:27Z

examples/llm/src/models/layers/gpt_blocks.py

+
+        self.attn = MultiheadAttention(cfg, device)
+        self.dropout_add_ln_1 = DropoutAddLayerNorm(cfg.d_model,
+                                                    prenorm=True,


What does prenorm=True mean?
Does this mean make it a pre(layer)norm block?

Is GPTBlock and OptimizedGPTBlock exactly the same?
If yes, can we add testes to verify this? (along with bwd pass tests?)

examples/llm/src/models/mosaic_gpt.py

examples/llm/src/models/param_init_fns.py

vchiley · 2023-03-21T15:40:47Z

So it looks like DropoutAddLayerNorm was designed to work with PostLN but GPT is a PreLN network style.
To shoehorn the op to work with PreLN networks, we have to pass around a as well as x

        a: torch.Tensor,
        x: torch.Tensor,

right?

I cant immediately see it, but it makes me wonder if there's a better design pattern for this...

maybe a model without the concept of a gpt_block so we dont need to pass a back and forth, its all just in the model.

nik-mosaic · 2023-03-31T18:26:08Z

Closing this PR for now, since #251, containing Fused Cross Entropy and vocab size updates, has been merged. Adding DropoutAddLayerNorm has been paused since it will break prior MosaicGPT checkpoints. This PR may be re-opened if we want to add DropoutAddLayerNorm in the future.

nik-mosaic added 7 commits February 27, 2023 10:37

Update with half of new gpt code

847c9ff

Merge branch 'mosaicml:main' into new-gpt

cd08770

Merge branch 'mosaicml:main' into new-gpt

a9c103e

Add Fused Dense instead of Fused MLP; update mosaic GPT blocks

98aabe3

Run pre-commit

21110bc

Remove unused gelu activation

69bf700

Merge branch 'main' into new-gpt

1f9a2c7

nik-mosaic requested a review from dskhudia March 3, 2023 14:59

nik-mosaic and others added 8 commits March 3, 2023 16:36

Add new setup.py for csrc folder.

5e2fa98

Update README

39f20fb

Pyright

78f198b

Fix typo

bd04ad6

Add try excepts for all CUDA imports

6b2d743

move files inside models subdir

22117b3

Add xentropy cuda check and fix typo

25331dd

Run precommit

f00d124

nik-mosaic requested a review from dakinggg March 3, 2023 18:05

nik-mosaic and others added 5 commits March 3, 2023 18:51

Resolve pyright issues

15d72a7

Exclude setup.py from pyright

16395e5

Re-add setup.py to pyling, fix errors, remove unnecessary setups

4982d1d

Merge branch 'main' into new-gpt

d133b2b

Run precommit

800f6d6

dakinggg reviewed Mar 3, 2023

View reviewed changes

nik-mosaic added 3 commits March 3, 2023 21:54

Update C++ files, fix FDP issue

9deaa8a

Update READMEs and setup.py

2f9075f

Precommit

eef5dc6

dskhudia approved these changes Mar 3, 2023

View reviewed changes

nik-mosaic added 2 commits March 20, 2023 10:20

run precommit

19aa8ea

Skip isort to resolve isort -> pyright error

dd14c20

remove fused mlp

b5316f4

nik-mosaic requested review from vchiley, abhi-mosaic and dakinggg March 21, 2023 06:55

nik-mosaic and others added 5 commits March 21, 2023 00:00

Merge branch 'main' into new-gpt

da2107f

Run precommit

002992f

Update README.md

7c2986e

Update README.md

d2183e0

Update README.md

c357a1d

vchiley reviewed Mar 21, 2023

View reviewed changes

examples/llm/src/models/mosaic_gpt.py Outdated Show resolved Hide resolved

vchiley reviewed Mar 21, 2023

View reviewed changes

examples/llm/src/models/mosaic_gpt.py Outdated Show resolved Hide resolved

vchiley reviewed Mar 21, 2023

View reviewed changes

examples/llm/src/models/param_init_fns.py Outdated Show resolved Hide resolved

nik-mosaic and others added 11 commits March 21, 2023 19:06

add tests, fix issues

c0eeddd

Merge branch 'main' into new-gpt

852ce30

Move Xentropy out of optimized block gate

3fb127c

Fix pyright and precommit

9fc7678

Fix install setup.py

a84deef

Update READMEs

f35af72

Merge branch 'main' into new-gpt

32d9280

run pyright

743c290

Merge branch 'main' into new-gpt

eae7998

Fix config import

991e27c

Fix some tests

d756397

nik-mosaic closed this Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Optimized GPT #210

Add Optimized GPT #210

nik-mosaic commented Mar 2, 2023 •

edited

Loading

dakinggg left a comment •

edited

Loading

dakinggg commented Mar 3, 2023

nik-mosaic commented Mar 3, 2023

dakinggg commented Mar 3, 2023

nik-mosaic commented Mar 20, 2023 •

edited

Loading

vchiley Mar 21, 2023 •

edited

Loading

vchiley commented Mar 21, 2023 •

edited

Loading

nik-mosaic commented Mar 31, 2023

Add Optimized GPT #210

Add Optimized GPT #210

Conversation

nik-mosaic commented Mar 2, 2023 • edited Loading

dakinggg left a comment • edited Loading

Choose a reason for hiding this comment

dakinggg commented Mar 3, 2023

nik-mosaic commented Mar 3, 2023

dakinggg commented Mar 3, 2023

nik-mosaic commented Mar 20, 2023 • edited Loading

vchiley Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

vchiley commented Mar 21, 2023 • edited Loading

nik-mosaic commented Mar 31, 2023

nik-mosaic commented Mar 2, 2023 •

edited

Loading

dakinggg left a comment •

edited

Loading

nik-mosaic commented Mar 20, 2023 •

edited

Loading

vchiley Mar 21, 2023 •

edited

Loading

vchiley commented Mar 21, 2023 •

edited

Loading