Skip to content

Releases: mosaicml/examples

v0.0.4

11 Apr 01:44
52cd4fe
Compare
Choose a tag to compare

🚀 Examples v0.0.4

Examples v0.0.4 is released! We've been hard at work adding features, fixing bugs, and improving our starter code for training models using MosaicML's stack!

To get started, either clone or fork this repo and install whichever example[s] you're interested in. E.g., to get started training GPT-style Large Language Models, just:

git clone https://github.com/mosaicml/examples.git
cd examples # cd into the repo
pip install -e ".[llm]"  # or pip install -e ".[llm-cpu]" if no NVIDIA GPU
cd examples/llm # cd into the specific example's folder

Available examples include llm, stable-diffusion, bert, resnet-cifar, resnet-imagenet, llm, deeplab, nemo, gpt-neox.

New Features

  1. Lots of improvements to our MosaicGPT example code, resulting in new and improved throughput and ease of use!

    • Updated throughput and MFU numbers (#271)
    • Various model architecture configuration options, including layer norm on keys and queries (#174), clipping of QKV (#197), omitting biases (#201), scaling the softmax (#209), more advanced weight initialization functions (#204, #220, #226), logit scaling (#221), better defaults (#270)
    • MosaicGPT is now a HuggingFace PreTrainedModel (#243, #252, #256)
    • Support for PrefixLM and UL2 style training (#179, #189, #235, #248)
    • Refactor the different attention implementations to all have compatible state dicts (#240)
    • Add support for KV caching (#244)
    • Fused Cross Entropy loss function (#251)
    • Full support for ALiBi with triton and torch implementations of attention
    • Support for "within sequence" attention when packing sequences together (#266)
    • Useful callbacks and optimizers for resuming runs that encountered a loss spike (#246)
  2. A new stable diffusion finetuning example! (#85)

    We've added an example of how to finetune stable diffusion using Composer and the MosaicML platform. Check out the README for more information.

  3. Updated ONNX export (#283) and text generation (#277) example scripts

  4. Version upgrades (#175, #242, #273, #275)

    Updated versions of PyTorch, Composer, and Streaming.

  5. Adds an example of running GPT-NeoX on the MosaicML platform (#195)

Deprecations and API changes

  1. convert_c4.py renamed to convert_dataset.py (#162)

    We renamed the dataset conversion script, and generalized it to work more easily with different input datasets.

  2. Renamed cifar and resnet to resnet-cifar and resnet-imagenet, respectively (#173)

v0.0.3

15 Feb 02:45
f46baab
Compare
Choose a tag to compare

🚀 Examples v0.0.3

Examples v0.0.3 is released! We've been hard at work adding features, fixing bugs, and improving our starter code for training models using MosaicML's stack!

To get started, either clone or fork this repo and install whichever example[s] you're interested in. E.g., to get started training GPT-style Large Language Models, just:

git clone https://github.com/mosaicml/examples.git
cd examples # cd into the repo
pip install -e ".[llm]"  # or pip install -e ".[llm-cpu]" if no NVIDIA GPU
cd examples/llm # cd into the specific example's folder

Available examples include bert, cifar, llm, resnet, deeplab.

New Features

  1. Tooling for computing throughput and Model Flops Utiliziation (MFU) using MosaicML Cloud (#53, #56, #71, #117, #152)

    We've made it easier to benchmark throughput and MFU on our Large Language Model (LLM) stack. The SpeedMonitor has been extended to report MFU. It is on by default for our MosaicGPT examples, and can be easily added to your own code by defining num_fwd_flops for your model and adding the SpeedMonitorMFU callback to the Trainer. See the callback for the details!

    We've also used our MCLI SDK to easily measure throughput and MFU of our LLMs across a range of parameters. The tools and results are in our throughput folder. Stay tuned for an update with the latest numbers!

  2. Upgrade to the latest versions of Composer, Streaming, and Flash Attention (#54, #61, #118, #124)

    We've upgraded all our examples to use the latest versions of Composer, Streaming, and Flash Attention. This means speed improvements, new features, and deterministic, elastic, mid-epoch resumption, thanks to our Streaming library!

  3. The repo is now pip installable from source (#76, #90)

    The repo can now be easily installed for whichever example you are interested in using. For example, to install components for the llm example, navigate to the root and run pip install -e .[llm]. We will be putting the package on PyPi soon!

  4. Support for FSDP wrapping more HuggingFace models (#83, #106)

    We've added support for using FSDP to wrap more types of HuggingFace models like BLOOM and OPT.

  5. In-Context Learning (ICL) evaluation metrics (#116)

    The ICL evaluation tools from Composer 0.12.1 are now available for measuring metrics like LAMBADA, HellaSwag, PIQA, etc. for Causal LMs. See the llm/icl_eval/ folder for templates. These ICL metrics can also be measured live during training with minimal overhead. Please see our blogpost for more details.

  6. Simple BERT finetuning example (#141)

    In addition to our example of finetuning BERT on the full suite of GLUE tasks, we've added an example of finetuning on a single sequence classification dataset. This should be a simple entrypoint to finetuning BERT compared with all the bells and whistles of our GLUE example.

  7. NeMo Megatron example (#84, #138)

    We've added a simple example of how to get started running NeMo Megatron on MCloud!

Deprecations

  1. 🚨 group_method argument for StreamingTextDataset replaced 🚨 (#128)

    In order to support the deterministic shuffle with elastic resumption, we could no longer concatenate text examples on the fly in the dataloader. This means that we have deprecated the group_method argument of StreamingTextDataset. In order to use concatenated text (which is a standard practice for pretraining LLMs), you can use the convert_c4.py script with the --concat_tokens option. This will pretokenize your dataset, and pack sequences together up to the maximum sequence length so that your pretraining examples have no padding. To use the equivalent of the old truncate option, you can use convert_c4.py without the --concat_tokens option, and the dataloader will truncate or pad sequences to the maximum sequence length on the fly.