-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Optimized GPT #210
Add Optimized GPT #210
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me as best I can tell. I did not review the "adapted/inspired" files. Let me know if there is anything specific in there you would like me to review. Would like to get another set of eyes for approval.
For testing, could you please show plots of
|
Thanks for the review. Generating the plots was my plan, but with 1B-sized models. I am currently resolving some exceptions that occur with the optimized blocks. All your comments are correct; I will make those readability and style fixes shortly. |
Sounds good, 1B model is fine too |
|
||
self.attn = MultiheadAttention(cfg, device) | ||
self.dropout_add_ln_1 = DropoutAddLayerNorm(cfg.d_model, | ||
prenorm=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does prenorm=True
mean?
Does this mean make it a pre(layer)norm block?
Is GPTBlock and OptimizedGPTBlock exactly the same?
If yes, can we add testes to verify this? (along with bwd pass tests?)
So it looks like
right? I cant immediately see it, but it makes me wonder if there's a better design pattern for this... maybe a model without the concept of a gpt_block so we dont need to pass |
Closing this PR for now, since #251, containing Fused Cross Entropy and vocab size updates, has been merged. Adding DropoutAddLayerNorm has been paused since it will break prior MosaicGPT checkpoints. This PR may be re-opened if we want to add DropoutAddLayerNorm in the future. |
Adds code for an optimized version of MosaicGPT. The user can specify cfg.gpt_block = 'optimized' to run this code, and cfg.gpt_block = 'standard' (or omit it) to run the standard code.
This PR contains:
src/mosaic_gpt.py
andsrc/models/layers/gpt_blocks.py
: Modeling changes to enable the optimized GPT: moving first layernorm of GPT blocks 2 - n to the end of blocks 1 - n-1, and replacing the final layernorm withln_f
withln_i
. These should be math-equivalent, so if you choose not to use cfg.gpt_block = 'optimized', your model will be unchanged.src/models/layers/gpt_blocks.py
: The new OptimizedGPTBlock itself.src/mosaic_gpt.py
: An option to use Fused Cross Entropy, which is enabled by default, installed via the standardpip install .[llm]
. Otherwise, you can setcfg.loss_fn=torch_crossentropy
to disable this and use standard torch.nn.CrossEntropy().csrc/
folder: C++/CUDA code for the one custom fusion we include. Since the HazyResearch DropoutAddLayerNorm does not support 30B and 70B models, we add code to support these models and require the user install it.llm/README.md
andllm/csrc/README.md
: Installation instructions for dependencies for optimized MosaicGPT.Along for the ride:
icl_evals/yamls
,yamls/mosaic_gpt/
,mcloud/
)Future Work:
csrc/
folder and include DropoutAddLayerNorm as a single line in a requirements file.