Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Optimized GPT #210

Closed
wants to merge 86 commits into from
Closed

Add Optimized GPT #210

wants to merge 86 commits into from

Conversation

nik-mosaic
Copy link
Contributor

@nik-mosaic nik-mosaic commented Mar 2, 2023

Adds code for an optimized version of MosaicGPT. The user can specify cfg.gpt_block = 'optimized' to run this code, and cfg.gpt_block = 'standard' (or omit it) to run the standard code.

This PR contains:

  • In src/mosaic_gpt.py and src/models/layers/gpt_blocks.py: Modeling changes to enable the optimized GPT: moving first layernorm of GPT blocks 2 - n to the end of blocks 1 - n-1, and replacing the final layernorm with ln_f with ln_i. These should be math-equivalent, so if you choose not to use cfg.gpt_block = 'optimized', your model will be unchanged.
  • In src/models/layers/gpt_blocks.py: The new OptimizedGPTBlock itself.
  • In src/mosaic_gpt.py: An option to use Fused Cross Entropy, which is enabled by default, installed via the standard pip install .[llm]. Otherwise, you can set cfg.loss_fn=torch_crossentropy to disable this and use standard torch.nn.CrossEntropy().
  • In the csrc/ folder: C++/CUDA code for the one custom fusion we include. Since the HazyResearch DropoutAddLayerNorm does not support 30B and 70B models, we add code to support these models and require the user install it.
  • In llm/README.md and llm/csrc/README.md: Installation instructions for dependencies for optimized MosaicGPT.

Along for the ride:

  • Changes all vocab_sizes in YAMLs to 50304 (icl_evals/yamls, yamls/mosaic_gpt/, mcloud/)

Future Work:

  • FusedMLP has been removed since it doesn't help performance on bf16 for CUDA < 11.8. In the future, when we support PyTorch 2.0 + CUDA 11.8, I will retest this.
  • After this PR, I will make a pull request in the HazyResearch repo to add DropoutAddLayerNorm support for 30B and 70B model sizes . Then, we can remove the csrc/ folder and include DropoutAddLayerNorm as a single line in a requirements file.
  • Bake in DropoutAddLayerNorm into our docker images, like we've done with FlashAttention, so that it does not need to build every time (~20 min. build time).

@nik-mosaic nik-mosaic requested a review from dskhudia March 3, 2023 14:59
@nik-mosaic nik-mosaic requested a review from dakinggg March 3, 2023 18:05
Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me as best I can tell. I did not review the "adapted/inspired" files. Let me know if there is anything specific in there you would like me to review. Would like to get another set of eyes for approval.

examples/llm/README.md Outdated Show resolved Hide resolved
examples/llm/csrc/README.md Outdated Show resolved Hide resolved
examples/llm/csrc/fused_dense_lib/README.md Outdated Show resolved Hide resolved
examples/llm/csrc/fused_dense_lib/fused_dense.cpp Outdated Show resolved Hide resolved
examples/llm/yamls/mosaic_gpt/1b.yaml Outdated Show resolved Hide resolved
examples/llm/src/models/ops/layer_norm.py Outdated Show resolved Hide resolved
examples/llm/src/models/ops/fused_dense.py Outdated Show resolved Hide resolved
examples/llm/src/models/mosaic_gpt.py Outdated Show resolved Hide resolved
examples/llm/src/models/layers/gpt_blocks.py Outdated Show resolved Hide resolved
examples/llm/src/models/layers/gpt_blocks.py Outdated Show resolved Hide resolved
@dakinggg
Copy link
Collaborator

dakinggg commented Mar 3, 2023

For testing, could you please show plots of

  1. gpt-125 without this pr
  2. gpt-125 standard with this pr
  3. gpt-125 optimized with this pr

@nik-mosaic
Copy link
Contributor Author

Thanks for the review. Generating the plots was my plan, but with 1B-sized models. I am currently resolving some exceptions that occur with the optimized blocks. All your comments are correct; I will make those readability and style fixes shortly.

@dakinggg
Copy link
Collaborator

dakinggg commented Mar 3, 2023

Sounds good, 1B model is fine too

examples/llm/README.md Outdated Show resolved Hide resolved
examples/llm/README.md Outdated Show resolved Hide resolved
examples/llm/src/models/layers/gpt_blocks.py Outdated Show resolved Hide resolved
examples/llm/src/models/mosaic_gpt.py Outdated Show resolved Hide resolved
examples/llm/src/models/layers/gpt_blocks.py Outdated Show resolved Hide resolved
examples/llm/src/models/mosaic_gpt.py Outdated Show resolved Hide resolved
@nik-mosaic
Copy link
Contributor Author

nik-mosaic commented Mar 20, 2023

image

image

Zooming in, we see that the loss curves for the standard and optimized blocks are within the non-determinism margin of variation. Shown in light and dark blue are two trials of the 125m model training run off the main branch.
image


self.attn = MultiheadAttention(cfg, device)
self.dropout_add_ln_1 = DropoutAddLayerNorm(cfg.d_model,
prenorm=True,
Copy link
Contributor

@vchiley vchiley Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does prenorm=True mean?
Does this mean make it a pre(layer)norm block?

Is GPTBlock and OptimizedGPTBlock exactly the same?
If yes, can we add testes to verify this? (along with bwd pass tests?)

@vchiley
Copy link
Contributor

vchiley commented Mar 21, 2023

So it looks like DropoutAddLayerNorm was designed to work with PostLN but GPT is a PreLN network style.
To shoehorn the op to work with PreLN networks, we have to pass around a as well as x

        a: torch.Tensor,
        x: torch.Tensor,

right?

I cant immediately see it, but it makes me wonder if there's a better design pattern for this...

maybe a model without the concept of a gpt_block so we dont need to pass a back and forth, its all just in the model.

@nik-mosaic
Copy link
Contributor Author

Closing this PR for now, since #251, containing Fused Cross Entropy and vocab size updates, has been merged. Adding DropoutAddLayerNorm has been paused since it will break prior MosaicGPT checkpoints. This PR may be re-opened if we want to add DropoutAddLayerNorm in the future.

@nik-mosaic nik-mosaic closed this Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants