Releases: mosaicml/composer
v0.13.2
🚀 Composer v0.13.2
Introducing the composer
PyPi package!
Composer v0.13.2 is released!
Composer can also now be installed using the new composer
PyPi package via pip
:
pip install composer==0.13.2
The legacy package name still works via pip
:
pip install mosaicml==0.13.2
Bug Fixes
- test and fix composer package name usage in composer_collect_env (#2049)
- Backward Compat with Torchmetrics by @mvpatel2000 (#2046)
- Fix OCIObjectStore save_overwrite=False bug (#2053)
- busy wait for the rank 0 download (#2071)
- Skip extra downloads when not using a format string (#2073)
What's Changed
- Pin transformers package to <4.27 by @dakinggg in #2076
- Bump version to v0.13.2 (#2068) by @bandish-shah
- Skip extra downloads when not using a format string by @dakinggg in #2073
- add support for autoresume + FSDP + sharding by @dakinggg in #2072
- busy wait for the rank 0 download by @dakinggg in #2071
- Revert "Checkpoints Simplified (#2059)" by @dakinggg in #2070
- Add
device
anddtype
back toLPLayerNorm
(#2067) by @abhi-mosaic - Checkpoints Simplified by @mvpatel2000 in #2059
- Allow
LPLayerNorm
andLPGroupNorm
to supportself.bias
orself.weight
= None (#2044) by @abhi-mosaic - Add
NO_REENTRANT
activation checkpointing (#2042) by @bmosaicml - pin torchmetrics by @mvpatel2000 in #2065
- Update docs with non-rank zero logs instructions by @hanlint in #2058
- Fix OCIObjectStore save_overwrite=False bug by @eracah in #2053
- Busy wait for local rank 0 download to avoid timeout on large file download by @dakinggg in #2054
- Raise error if attempting to export FSDP model by @hanlint in #2051
- Revert "Checkpoints Simplified (#2041)" by @dakinggg in #2056
- Delete composer package GPU workflow by @dakinggg in #2055
- Add composer PyPI package tests to daily workflow (#2052) by @bandish-shah
- Checkpoints Simplified by @mvpatel2000 in #2041
- update fsdp mixed precision by @vchiley in #2047
- Backward Compat with Torchmetrics by @mvpatel2000 in #2046
- Update FSDP meta weight tying tests to include precision testing by @bcui19 in #2050
- Log nodename information in composer by @eracah in #2043
- test and fix composer package name usage in composer_collect_env by @dakinggg in #2049
- Adjust how HuggingFaceModel handles embedding resizing by @dakinggg in #2027
- Adds a PR guidelines section to contributing.md by @dakinggg in #1993
- Bump pypandoc from 1.10 to 1.11 (#2038) by @dependabot[bot]
- Bump pytest from 7.2.1 to 7.2.2 (#2039) by @dependabot[bot]
- Use follow in mcp script by @mvpatel2000 in #2022
Full Changelog: v0.13.1...v0.13.2
v0.13.1
🚀 Composer v0.13.1
Introducing the composer
PyPi package!
Composer v0.13.1 is released!
Composer can also now be installed using the new composer
PyPi package via pip
:
pip install composer==0.13.1
The legacy package name still works via pip
:
pip install mosaicml==0.13.1
Note: The mosaicml==0.13.0
PyPi package was yanked due to some minor packaging issues discovered after release. The package was re-released as Composer v0.13.1, thus these release notes contain details for both v0.13.0 and v0.13.1.
New Features
-
🤙 New and Updated Callbacks
-
New
HealthChecker
Callback (#2002)The callback will log a warning if the GPUs on a given node appear to be in poor health (low utilization). The callback can also be configured to send a Slack message!
from composer import Trainer from composer.callbacks import HealthChecker # Warn if GPU utilization difference drops below 10% health_checker = HealthChecker( threshold = 10 ) # Construct Trainer trainer = Trainer( ..., callbacks=health_checker, ) # Train! trainer.fit()
-
Updated
MemoryMonitor
to use GigaBytes (GB) units (#1940) -
New
RuntimeEstimator
Callback (#1991)Estimate the remaining runtime of your job! Approximates the time remaining by observing the throughput and comparing to the number of batches remaining.
from composer import Trainer from composer.callbacks import RuntimeEstimator # Construct trainer with RuntimeEstimator callback trainer = Trainer( ..., callbacks=RuntimeEestimator(), ) # Train! trainer.fit()
-
Updated
SpeedMonitor
throughput metrics (#1987)Expands throughput metrics to track relative to several different time units and per device:
throughput/batches_per_sec
andthroughput/device/batches_per_sec
throughput/tokens_per_sec
andthroughput/device/tokens_per_sec
throughput/flops_per_sec
andthroughput/device/flops_per_sec
throughput/device/samples_per_sec
Also adds
throughput/device/mfu
metric to compute per device MFU. Simply enable theSpeedMonitor
callback per usual to log these new metrics! Please see SpeedMonitor documentation for more information.
-
-
⣿ FSDP Sharded Checkpoints (#1902)
Users can now specify the
state_dict_type
in thefsdp_config
dictionary to enable sharded checkpoints. For example:from composer import Trainer fsdp_confnig = { 'sharding_strategy': 'FULL_SHARD', 'state_dict_type': 'local', } trainer = Trainer( ..., fsdp_config=fsdp_config, save_folder='checkpoints', save_filename='ba{batch}_rank{rank}.pt', save_interval='10ba', )
Please see the PyTorch FSDP docs and Composer's Distributed Training notes for more information.
-
🤗 HuggingFace Improvements
- Update
HuggingFaceModel
class to support encoder-decoder batches withoutdecoder_input_ids
(#1950) - Allow evaluation metrics to be passed to
HuggingFaceModel
directly (#1971) - Add a utility function to load a Composer checkpoint of a
HuggingFaceModel
and write out the expectedconfig.json
andpytorch_model.bin
in the HuggingFace pretrained folder (#1974)
- Update
-
🛟 Nvidia H100 Alpha Support - Added
amp_fp8
data typeIn preparation for H100's arrival, we've added the
amp_fp8
precision type. Currently settingamp_fp8
specifies a new precision context usingtransformer_engine.pytorch.fp8_autocast.
For more details, please see Nvidia's new Transformer Engine and the specific fp8 recipe we utilize.from composer import Trainer trainer = Trainer( ..., precision='amp_fp8', )
API changes
-
The
torchmetrics
package has been upgraded to 0.11.x.The
torchmetrics.Accuracy
metric now requires atask
argument which can take on a value ofbinary
,multiclass
ormultilabel
. Please see Torchmetrics Accuracy docs for details.Additonally, since specifying
value='multiclass'
requires an additional field ofnum_classes
to be specified, we've had to updateComposerClassifier
to accept the additionalnum_classes
argument. Please see PR's #2017 and #2025 for additional details -
Surgery algorithms used in functional form return a value of
None
(#1543)
Deprecations
- Deprecate HFCrossEntropy and Perplexity (#1857)
- Remove Jenkins CI (#1943, #1954)
- Change Deprecation Warnings to Warnings for specifying
ProgressBarLogger
andConsoleLogger
to loggers (#1846)
Bug Fixes
- Fixed an issue introduced in 0.12.1 where
HuggingFaceModel
crashes ifconfig.return_dict = False
(#1948) - Refactor EMA to improve memory efficiency (#1941)
- Make wandb checkpoint logging compatible with wandb model registry (#1973)
- Fix ICL race conditions (#1978)
- Update
epoch
metric name totrainer/epoch
(#1986) - reset scaler (#1999)
- Bug/sync optimization logger across ranks (#1970)
- Update Docker images to fix resolve vulnerability scan issues (#2007)
- Fix eval duplicate logging issue (#2018)
- extend test and patch bug (#2028)
- Protect for missing slack_sdk import (#2031)
Known Issues
- Docker Image Security Vulnerability
- CVE-2022-45907: The
mosaicml/pytorch:1.12.1*
,mosaicml/pytorch:1.11.0*
,mosaicml/pytorch_vision:1.12.1*
andmosaicml/pytorch_vision:1.11.0*
images are impacted and currently supported for legacy use cases. We recommend users upgrade to images with PyTorch >1.13. The affected images will be removed in the next Composer release.
- CVE-2022-45907: The
What's Changed
- Raise error if max duration is in epochs and dataloader is infinite by @dakinggg in #1942
- Bump traitlets from 5.8.0 to 5.9.0 by @dependabot in #1946
- Deprecate HFCrossEntropy and Perplexity by @dakinggg in #1857
- Change functional surgery method return values to None by @nik-mosaic in #1543
- Retire Jenkins by @bandish-shah in #1943
- Update MCP GHA Name by @mvpatel2000 in #1951
- update memory monitor by @mvpatel2000 in #1940
- Move ffcv up in test order by @dskhudia in #1953
- Fix memory monitor test by @mvpatel2000 in #1957
- Fix model surgery failure due to functional API change by @nik-mosaic in #1949
- Change how we check for forwards args in models for HF models by @bcui19 in #1955
- add return dict false test and bug fix by @dakinggg in #1948
- remove jenkins ci by @mvpatel2000 in #1954
- add support for enc-dec batches without decoder_input_ids by @dakinggg in #1950
- Refactor EMA to improve memory efficiency by @coryMosaicML in #1941
- Add warning for untrusted checkpoints by @mvpatel2000 in #1959
- permit opt tokenizer by @bmosaicml in #1958
- GHA Docker build flow for PR's by @bandish-shah in #1883
- Update download badge link to pepy by @karan6181 in #1966
- Update python version in setup.py and fixed pypi download badge by @karan6181 in #1969
- allow eval metrics to be passed in to HuggingFaceModel directly by @dakinggg in #1971
- Make wandb checkpoint logging compatible with wandb model registry by @growlix in #1973
- Add support for FP8 on H100 using NVidia's TransformerEngine by @dskhudia in #1965
- Util for writing HuggingFace save_pretrained from a composer checkpoint by @dakinggg in #1974
- Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) by @eracah in #1902
- Bump custom-inherit from 2.4.0 to 2.4.1 by @dependabot in #1981
- Bump gitpython from 3.1.30 to 3.1.31 by @dependabot in #1982
- Fix ICL race conditions by @dakinggg in #1978
- add...
v0.13.0
This release has been yanked due to a minor packaging issue, please skip directly to Composer v0.13.1
What's Changed
- Raise error if max duration is in epochs and dataloader is infinite by @dakinggg in #1942
- Bump traitlets from 5.8.0 to 5.9.0 by @dependabot in #1946
- Deprecate HFCrossEntropy and Perplexity by @dakinggg in #1857
- Change functional surgery method return values to None by @nik-mosaic in #1543
- Retire Jenkins by @bandish-shah in #1943
- Update MCP GHA Name by @mvpatel2000 in #1951
- update memory monitor by @mvpatel2000 in #1940
- Move ffcv up in test order by @dskhudia in #1953
- Fix memory monitor test by @mvpatel2000 in #1957
- Fix model surgery failure due to functional API change by @nik-mosaic in #1949
- Change how we check for forwards args in models for HF models by @bcui19 in #1955
- add return dict false test and bug fix by @dakinggg in #1948
- remove jenkins ci by @mvpatel2000 in #1954
- add support for enc-dec batches without decoder_input_ids by @dakinggg in #1950
- Refactor EMA to improve memory efficiency by @coryMosaicML in #1941
- Add warning for untrusted checkpoints by @mvpatel2000 in #1959
- permit opt tokenizer by @bmosaicml in #1958
- GHA Docker build flow for PR's by @bandish-shah in #1883
- Update download badge link to pepy by @karan6181 in #1966
- Update python version in setup.py and fixed pypi download badge by @karan6181 in #1969
- allow eval metrics to be passed in to HuggingFaceModel directly by @dakinggg in #1971
- Make wandb checkpoint logging compatible with wandb model registry by @growlix in #1973
- Add support for FP8 on H100 using NVidia's TransformerEngine by @dskhudia in #1965
- Util for writing HuggingFace save_pretrained from a composer checkpoint by @dakinggg in #1974
- Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) by @eracah in #1902
- Bump custom-inherit from 2.4.0 to 2.4.1 by @dependabot in #1981
- Bump gitpython from 3.1.30 to 3.1.31 by @dependabot in #1982
- Fix ICL race conditions by @dakinggg in #1978
- add map location to huggingface utils by @dakinggg in #1980
- fix log epoch by @mvpatel2000 in #1986
- GHA release workflow, refactor PR and Daily workflows by @bandish-shah in #1968
- Remove python-version input from Daily CPU tests by @bandish-shah in #1989
- Add some logic to pass the correct github ref to mcp script by @bandish-shah in #1990
- Fix typo in docstring for eval with missing space by @mvpatel2000 in #1992
- Fix failing sharded_checkpoint tests that fail when pytorch 1.13 is not installed by @eracah in #1988
- Add merge_group event trigger to GHA daily workflow by @bandish-shah in #1996
- Runtime estimator by @mvpatel2000 in #1991
- Reset scaler state by @mvpatel2000 in #1999
- Speed monitor refactor by @mvpatel2000 in #1987
- Test hf fsdp by @dakinggg in #1972
- Bug/sync optimization logger across ranks by @bmosaicml in #1970
- Fix optimizer monitor test gating with FSDP by @mvpatel2000 in #2000
- Low precision groupnorm by @mvpatel2000 in #1976
- Bump coverage[toml] from 7.1.0 to 7.2.1 by @dependabot in #2008
- Update docs to include runtime estimator by @mvpatel2000 in #2009
- Tag surgery algorithms LPLN and LPGN by @mvpatel2000 in #2011
- Update SpeedMonitor short-description for docs table by @mvpatel2000 in #2010
- Update Low Precision LayerNorm arguments by @nik-mosaic in #1994
- Medical Segmentation Example Typo by @mvpatel2000 in #2014
- Update wallclock logging to default hours by @mvpatel2000 in #2005
- Add HealthChecker Callback by @hanlint in #2002
- Allow FX graph mode post-training dynamic quantisation of BlurConv2d operations. by @BrettRyland in #1995
- Add multi-gpu testing to test_algorithm_resumption by @eracah in #2016
- Add backwards compatible checkpoint loading for EMA by @coryMosaicML in #2012
- fsdp with custom process groups by @vchiley in #2006
- Patch Speed Monitor MFU by @mvpatel2000 in #2013
- Remove runtime estimator state dict by @mvpatel2000 in #2015
- Update Docker images to fix resolve vulnerability scan issues by @bandish-shah in #2007
- Change Deprecation Warnings to Warnings for specifying ProgressBarLogger and ConsoleLogger to loggers by @eracah in #1846
- Fix eval duplicate logging issue by @mvpatel2000 in #2018
- Add workflow_dispatch trigger to pr-docker workflow by @bandish-shah in #2019
- Bump streaming version to less than 0.4.0 by @karan6181 in #2020
- Upgrade ipython installed in Docker images by @bandish-shah in #2021
- Upgrade torchmetrics by @nik-mosaic in #2017
- Complete upgrade of torchmetrics accuracy by @nik-mosaic in #2025
- Bump version to v0.13.0 by @bandish-shah in #2024
New Contributors
- @BrettRyland made their first contribution in #1995
Full Changelog: v0.12.1...v0.13.0
v0.12.1
🚀 Composer v0.12.1
Composer v0.12.1 is released! Install via pip
:
pip install --upgrade mosaicml==0.12.1
New Features
-
📚 In-Context Learning (#1876)
With Composer and MosaicML Cloud you can now evaluate LLMs on in-context learning tasks (LAMBADA, HellaSwag, PIQA, and more) hundreds of times faster than other evaluation harnesses. Please see our "Blazingly Fast LLM Evaluation for In-Context Learning" blog post for more details!
-
💾 Added support for Coreweave Object Storage (#1915)
Coreweave object store is compatible with
boto3
. Uploading objects to Coreweave object store is almost exactly like writing to using S3, except anendpoint_url
must be set via theS3_ENDPOINT_URL
environment variable. For example:import os os.environ['S3_ENDPOINT_URL'] = 'https://object.las1.coreweave.com' from composer.trainer import Trainer # Save checkpoints every epoch to s3://my_bucket/checkpoints trainer = Trainer( model=model, train_dataloader=train_dataloader, max_duration='10ep', save_folder='s3://my_bucket/checkpoints', save_interval='1ep', save_overwrite=True, save_filename='ep{epoch}.pt', save_num_checkpoints_to_keep=0, # delete all checkpoints locally ) trainer.fit()
Please see our checkpointing documentation for more details.
-
🪵 Automatic logging of Trainer hparams (#1855)
Hyperparameter arguments passed to the
Trainer
are now automatically logged. Simply set theTrainer
argumentauto_log_hparams=True
.
Bug Fixes
- Update Docker images to use ‘posix_prefix’ paths (#1854)
- Disable new notebook in CI (#1875)
- [Fix] Enable logging of metrics from Callbacks to ConsoleLogging (#1884)
- Ensure loggers run init event before callbacks in Engine (#1890)
- Raise an error in FSDP meta tensor initialization if there's no initialization functions, fix associated flaky FSDP test (#1905)
- Add primitive list support (#1906)
- Add logic for shifting labels before computing metrics (#1913)
- Fixes mis specified dependency (#1919)
- pin setuptools in build requirements (#1926)
- Pin pip<23 in Docker images (#1936)
- Fix bug in trainer.eval and add test cases for test_console_logger (#1937)
What's Changed
- Rename GradMonitor -> OptimizerMonitor; add functionality to log optimizer-specific metrics to assist loss spike investigation by @bmosaicml in #1743
- Add GCS uri support for loading and saving checkpoints by @eracah in #1833
- HF factory function tests by @dakinggg in #1832
- Fix doc issue, Trainer hparam log_to_console defaults to False by @eracah in #1840
- Removed YAHP references from Docs by @bandish-shah in #1841
- Typo by @nguyenhoan1988 in #1843
- Fix source code links in docs by @bandish-shah in #1844
- add importorskip by @dakinggg in #1847
- Update Docker images to use ‘posix_prefix’ paths by @mvpatel2000 in #1854
- Fix typo by @StandardAI in #1849
- ConsoleLogger: log first batch and first epoch when using console_log_interval by @eracah in #1860
- Simpler auto log hparams by @eracah in #1855
- Fix typos by @cclauss in #1850
- Bump sphinxext-opengraph from 0.7.3 to 0.7.4 by @dependabot in #1851
- Bump coverage[toml] from 6.5.0 to 7.0.1 by @dependabot in #1853
- Bump traitlets from 5.7.0 to 5.8.0 by @dependabot in #1852
- Bump ipython from 7.32.0 to 8.8.0 by @dependabot in #1865
- Update monai requirement from <0.10,>=0.9.1 to >=0.9.1,<1.2 by @dependabot in #1869
- Bump sphinxcontrib-katex from 0.9.3 to 0.9.4 by @dependabot in #1868
- Bump coverage[toml] from 7.0.1 to 7.0.4 by @dependabot in #1867
- Upgrade docker images to
torch==1.13.1
by @abhi-mosaic in #1863 - add more useful info to state by @dakinggg in #1848
- Feature/lambada evaluator by @bmosaicml in #1845
- multi-node distributed training, submitit & composer integration demo by @YilunKuang in #1753
- Daily tests by @mvpatel2000 in #1870
- Disable new notebook in CI by @mvpatel2000 in #1875
- Update deepspeed by @mvpatel2000 in #1864
- fix fail fast in daily by @mvpatel2000 in #1880
- Fix getting started docs by @mvpatel2000 in #1878
- Speed up test_lm_task_evaluation by @mvpatel2000 in #1879
- Fix unprotected import by @mvpatel2000 in #1874
- add ignore_modules to fsdp by @vchiley in #1877
- Change vision image by @mvpatel2000 in #1881
- Fix eval_forward in the ComposerModel ABC by @eracah in #1871
- Fix fsdp weight tying by @bcui19 in #1856
- Bump pytest from 7.2.0 to 7.2.1 by @dependabot in #1886
- Bump ipykernel from 6.19.2 to 6.20.1 by @dependabot in #1887
- Bump gitpython from 3.1.28 to 3.1.30 by @dependabot in #1888
- Update Vision Image in Pytest by @mvpatel2000 in #1882
- Streaming data tests by @dakinggg in #1842
- Add NLP Algorithms Tests by @nik-mosaic in #1839
- rename HF notebook by @dakinggg in #1873
- Ensure loggers run init event before callbacks in Engine by @eracah in #1890
- [Fix] Enable logging of metrics from Callbacks to ConsoleLogging by @eracah in #1884
- Updating how we load metrics in a state_dict so we don't add extra memory overhead by @bcui19 in #1892
- Getting daily tests passing by @dakinggg in #1893
- Bump nbsphinx from 0.8.10 to 0.8.12 by @dependabot in #1897
- Fix docker image by @mvpatel2000 in #1894
- Add primitive list support by @mvpatel2000 in #1906
- Raise an error in FSDP
meta
tensor initialization if there's no initialization functions, fix associated flaky FSDP test by @bcui19 in #1905 - Gpu Test by @mvpatel2000 in #1907
- Update docker with FFCV fix by @mvpatel2000 in #1908
- Restore GPU tests by @mvpatel2000 in #1909
- Update workflow names by @mvpatel2000 in #1910
- Enable daily gpu tests by @mvpatel2000 in #1911
- Tweak daily GPU tests by @mvpatel2000 in #1912
- Daily GPU Tests -- Change to Git Commit by @mvpatel2000 in #1914
- Add logic for shifting labels before computing metrics by @alextrott16 in #1913
- Add coreweave object store support. by @eracah in #1915
- Fixes mis specified dependency by @dakinggg in #1919
- Bump coverage[toml] from 7.0.4 to 7.1.0 by @dependabot in #1923
- Update importlib-metadata requirement from <6,>=5.0.0 to >=5.0.0,<7 by @dependabot in #1921
- pin setuptools in build requirements by @dakinggg in #1926
- Remove synthetic testing infrastructure for HF/NLP by @dakinggg in #1895
- Add upgrade flags to pip installs by @dakinggg in #1916
- Temporarily pin pip to <23 by @dakinggg in #1930
- add link protection by @mvpatel2000 in #1927
- Cleaning up error checking for FSDP sharding strategies with fp32 precision by @bcui19 in #1925
- Fix mcp script to avoid follow by @mvpatel2000 in #1932
- Emit Eval progress in console logging by @eracah in #1917
- Remove Fused LayerNorm deprecation by @nik-mosaic in https://github.com/mosaicml/comp...
v0.12.0
🚀 Composer v0.12.0
Composer v0.12.0 is released! Install via pip
:
pip install mosaicml==0.12.0
New Features
-
🪵 Logging and ObjectStore Enhancements
There are multiple improvements to our logging and object store support in this release.
-
Image visualization using our
CometMLLogger
(#1710)We've added support for using our
ImageVisualizer
callback with CometML to log images and segmentation masks to CometML.from composer.trainer import Trainer trainer = Trainer(..., callbacks=[ImageVisualizer()], loggers=[CometMLLogger()] )
-
Added direct support for Oracle Cloud Infrastructure (OCI) as an
ObjectStore
(#1774) and support for Google Cloud Storage (GCS) via URI (#1833)To use, you can simply set your
save_folder
orload_path
to a URI beginning withoci://
orgs://
, to save and load with OCI and GCS respectively.from composer.trainer import Trainer # Checkpoint saving to Google Cloud Storage. trainer = Trainer( model=model, save_folder="gs://my-bucket/{run_name}/checkpoints", run_name='my-run', save_interval="1ep", save_filename="ep{epoch}.pt", save_num_checkpoints_to_keep=0, # delete all checkpoints locally ... ) trainer.fit()
-
Added basic support for logging with MLFlow (#1795)
We've added basic support for using MLFlow to log experiment metrics.
from composer.loggers import MLFlowLogger from composer.trainer import Trainer mlflow_logger = MLFlowLogger(experiment_name=mlflow_exp_name, run_name=mlflow_run_name, tracking_uri=mlflow_uri) trainer = Trainer(..., loggers=[mlflow_logger])
-
Simplified console and progress bar logging (#1694)
To turn off the progress bar, set
progress_bar=False
. To turn on logging directly to the console, setlog_to_console=True
. To control the frequency of logging to console, setconsole_log_interval
(e.g. to1ep
or1ba
). -
Our
get_file
utility now supports URIs directly (s3://
,oci://
, andgs://
) for downloading files.
-
-
🏃♀️ Support for Mid-Epoch Resumption with the latest release of Streaming
We've added support in Composer for the latest release of our Streaming library. This includes awesome new features like instant mid epoch resumption and deterministic shuffling, regardless of the number of nodes. See the Streaming release notes for more!
-
🚨 New algorithm -
GyroDropout
!Thanks to @jelite for adding a new algorithm,
GyroDropout
to Composer! Please see the method card for more details. -
🤗 HuggingFace + Composer improvements
We've added a new utility to load a 🤗 HuggingFace model and tokenizer out of a Composer checkpoint (#1754), making the pretraining -> finetuning workflow even easier in Composer. Check out the docs for more details, and our example notebook for a full tutorial (#1775)!
-
🎓 GradMonitor -> OptimizerMonitor
Renames our
GradMonitor
callback toOptimizerMonitor
, and adds the ability to track optimizer specific metrics. Check out the docs for more details, and add to your code just like any other callback!from composer.callbacks import OptimizerMonitor from composer.trainer import Trainer trainer = Trainer( ..., callbacks=[OptimizerMonitor(log_optimizer_metrics=log_optimizer_metrics)] )
-
🐳 New PyTorch and CUDA versions
We've expanded our library of Docker images with support for PyTorch 1.13 + CUDA 11.7:
mosaicml/pytorch:1.13.0_cu117-python3.10-ubuntu20.04
mosaicml/pytorch:1.13.0_cpu-python3.10-ubuntu20.04
The
mosaicml/pytorch:latest
,mosaicml/pytorch:cpu_latest
andmosaicml/composer:0.12.0
tags are now built from PyTorch 1.13 based images. Please see our DockerHub repository for additional details.
API changes
-
Replace
grad_accum
withdevice_train_microbatch_size
(#1749, #1776)We're deprecating the
grad_accum
Trainer argument in favor of the more intuitivedevice_train_microbatch_size
. Instead of thinking about how to divide your specified minibatch into microbatches, simply specify the size of your microbatch. For example, let's say you want to split your minibatch of 2048 into two microbatches of 1024:from composer import Trainer trainer = Trainer( ..., device_train_microbatch_size=1024, )
If you want Composer to tune the microbatch for you automatically, enable automatic microbatching as follows:
from composer import Trainer trainer = Trainer( ..., device_train_microbatch_size='auto', )
The
grad_accum
argument is still supported but will be deprecated in the next Composer release. -
Renamed precisions (#1761)
We've renamed precision attributes for clarity. The following values have been removed:
['amp', 'fp16', bf16']
.We have added the following values, prefixed with 'amp' to clarify when an Automatic Mixed Precision type is being used:
['amp_fp16', 'amp_bf16']
.The
fp32
precision value remains unchanged.
Deprecations
- Removed support for YAHP (#1512)
- Removed COCO and SSD datasets (#1717)
- Fully removed Streaming v1 support, please see the mosaicml/streaming project for our next-gen streaming datasets (#1787)
- Deprecated
FusedLayerNorm
algorithm (#1789) - Fully removed
grad_clip_norm
training argument, please use theGradientClipping
algorithm instead (#1768) - Removed
data_fit
,data_epoch
, anddata_batch
fromLogger
(#1826)
Bug Fixes
- Fix FSDP checkpoint strategy (#1734)
- Fix gradient clipping with FSDP (#1740)
- Adds more supported FSDP config flags (
sync_module_states
,forward_prefecth
,limit_all_gathers
) (#1794) - Allow
FULL
precision with FSDP (#1796) - Fix
eval_microbatch
modification onEVAL_BEFORE_FORWARD
event (#1739) - Fix algorithm API backwards compatibility in checkpoints (#1741)
- Fixes a bad
None
check preventing settingdevice_id
to0
(#1767) - Unregister engine to make cleaning up memory easier (#1769)
- Fix issue if
metric_names
is not a list (#1798) - Match implementation for list and tensor batch splitting (#1804)
- Fixes infinite eval issue (#1815)
What's Changed
- Update installation constraints for streaming by @karan6181 in #1661
- Update decoupled_weight_decay.md by @jacobfulano in #1672
- Notebooks part 2 by @dakinggg in #1659
- Add trainer arg for engine passes by @mvpatel2000 in #1673
- Autoload algorithms by @mvpatel2000 in #1658
- Faster metrics calculations + Fix warnings added by the new version of torchmetrics by @dskhudia in #1674
- Update coolname requirement from <2,>=1.1.0 to >=1.1.0,<3 by @dependabot in #1666
- Bump ipykernel...
v0.11.1
🚀 Composer v0.11.1
Composer v0.11.1 is released! Install via pip
:
pip install --upgrade mosaicml==0.11.1
Bug Fixes
- Fixes for Notebooks (#1659)
- Documentation updates and fixes (#1685, #1696, #1702, #1709)
- Addressed warnings and speed improvements for Torchmetrics (#1674)
- Fixes to Gated Linear Units method (#1575, #1689)
- Set
NCCL_ASYNC_ERROR_HANDLING
ENV variable in Composer launcher to enable distributed timeout (#1695) - Fix epoch count when
eval
is called beforefit
(#1697) - Constrain PyTorch package versions to avoid unintended upgrades (#1688)
- Fix Optimizer state sharding issue with FSDP (#1732)
- Rase
ValueError
with if evaluation dataloader of infinite length is specified
Full Changelog: v0.11.0...v0.11.1
v0.11.0
🚀 Composer v0.11.0
Composer v0.11.0 is released! Install via pip
:
pip install --upgrade mosaicml==0.11.0
New Features
-
🧰 FSDP Beta Support
Composer now supports PyTorch FSDP! PyTorch FSDP is a strategy for distributed training, similar to PyTorch DDP, that distributes work using data-parallelism only. On top of this, FSDP uses model, gradient, and optimizer sharding to dramatically reduce device memory requirements, and enables users to easily scale and train large models.
Here's how easy it is to use FSDP with Composer:
import torch.nn as nn from composer import Trainer class Block (nn.Module): ... # Your custom model class Model(nn.Module): def __init__(self, n_layers): super().__init__() self.blocks = nn.ModuleList([ Block(...) for _ in range(n_layers) ]), self.head = nn.Linear(...) def forward(self, inputs): ... # FSDP Wrap Function def fsdp_wrap_fn(self, module): return isinstance(module, Block) # Activation Checkpointing Function def activation_checkpointing_fn(self, module): return isinstance(module, Block) # ComposerModel wrapper, used by the Trainer # to compute loss, metrics, etc. class MyComposerModel(ComposerModel): def __init__(self, n_layers): super().__init__() self.model = Model(n_layers) ... def forward(self, batch): ... def eval_forward(self, batch, outputs=None): ... def loss(self, outputs, batch): ... # Pass your ComposerModel and fsdp_config into the Trainer composer_model = MyComposerModel(n_layers=3) fsdp_config = { 'sharding_strategy': 'FULL_SHARD', 'min_params': 1e8, 'cpu_offload': False, # Not supported yet 'mixed_precision': 'DEFAULT', 'backward_prefetch': 'BACKWARD_POST', 'activation_checkpointing': False, 'activation_cpu_offload': False, 'verbose': True } trainer = Trainer( model=composer_model, fsdp_config=fsdp_config, ... ) trainer.fit()
For more information, please see our FSDP docs.
-
🚰 Streaming v0.1
We've spun off Streaming datasets into it's own repository! Streaming datasets is a high-performance drop-in for Torch
IterableDataset
, enabling users to stream training data from cloud based object stores. Streaming is shipping with built-in support for popular open source datasets (ADE20K, C4, COCO, Enwiki, ImageNet, etc.)To get started, install the Streaming PyPi package:
pip install mosaicml-streaming
You can use the streaming Dataset class with the PyTorch native DataLoader class as follows:
import torch from streaming import Dataset dataloader = torch.utils.data.DataLoader(dataset=Dataset(remote='s3://...'))
For more information, please check out the Streaming docs.
-
✔👉 Simplified Checkpointing Interface
With this release we’ve greatly simplified configuration of loading and saving checkpoints in Composer.
To save checkpoints to S3, all you need to do is:
- Specify with
save_folder
your full URI to your save directory destination (e.g.'s3://my-bucket/{run_name}/checkpoints'
) - Optionally, set
save_filename
to the pattern you want for your checkpoint file names
from composer.trainer import Trainer # Checkpoint saving to S3. trainer = Trainer( model=model, save_folder="s3://my-bucket/{run_name}/checkpoints", run_name='my-run', save_interval="1ep", save_filename="ep{epoch}.pt", save_num_checkpoints_to_keep=0, # delete all checkpoints locally ... ) trainer.fit()
Likewise, to load checkpoints from S3, all you have to do is:
- Set
load_path
to the full URI to your desired checkpoint file (e.g.'s3://my-bucket/my-run/checkpoints/epoch13.pt'
)
from composer.trainer import Trainer # Checkpoint loading from S3. new_trainer = Trainer( model=model, train_dataloader=train_dataloader, max_duration="10ep", load_path="s3://my-bucket/my-run/checkpoints/ep13.pt", ) new_trainer.fit()
For more information, please see our Checkpointing guide.
- Specify with
-
𐄳 Improved Distributed Experience
We’ve made it easier to write your own custom distributed entry points by exposing our distributed API. You can now leverage all of our helpful distributed functions and contexts.
For example, let's say we want to need to download a dataset in a distributed training application. To avoid race conditions where different ranks try to write the dataset to the same place, we need to ensure that only rank 0 downloads the dataset first:
import datetime from composer.trainer.devices import DeviceGPU from composer.utils import dist dist.initialize(DeviceGPU(), datetime.timedelta(seconds=30)) # Initialize distributed module if dist.get_local_rank() == 0: # Download dataset on rank zero dataset = download_my_dataset() dist.barrier() # All ranks wait until dataset is downloaded # Create and train your model!
For more information, please check out our Distributed API docs.
Bug Fixes
- fix loss and eval_forward for HF models (#1597)
- add more robust casting to int for fsdp min_params (#1608)
- Deepspeed Docs Typo (#1605)
- Fix mmdet typo (#1618)
- Blurpool idempotent (#1625)
- When model is not on
meta
device, initialization should occur on compute device not CPU (#1623) - Auto resumption (#1615)
- Adjust speed monitor (#1645)
- Hot fix console logging (#1643)
- Lazy Logging + pretty print dict for hparams (#1653)
- Fix many failing notebook tests (#1646)
What's Changed
- Bump coverage[toml] from 6.4.4 to 6.5.0 by @dependabot in #1583
- Bump furo from 2022.9.15 to 2022.9.29 by @dependabot in #1584
- Add English Wikipedia 2020-01-01 dataset by @knighton in #1572
- Add pull request template by @dakinggg in #1588
- Bump ipykernel from 6.15.3 to 6.16.0 by @dependabot in #1587
- Update importlib-metadata requirement from <5,>=4.11.0 to >=5.0,<6 by @dependabot in #1585
- Bump sphinx-argparse from 0.3.1 to 0.3.2 by @dependabot in #1586
- Add step explicitly to ImageVisualizer logging calls by @dakinggg in #1591
- Image viz test by @dakinggg in #1592
- Remove unused fixture by @mvpatel2000 in #1594
- Fixes RandAugment API by @mvpatel2000 in #1596
- fix loss and eval_forward for HF models by @dskhudia in #1597
- Remove tensorflow-io from setup.py by @eracah in #1577
- Fixes enwiki for the newly processed wiki dataset by @dskhudia in #1600
- Change install to all by @mvpatel2000 in #1599
- Remove log level and should_log_artifact by @dakinggg in #1603
- Add more robust casting to int for fsdp min_params by @dblalock in #1608
- Deepspeed Docs Typo by @mvpatel2000 in #1605
- Object store logger refactor by @dakinggg in #1601
- Bump gitpython from 3.1.27 to 3.1.28 by @dependabot in #1609
- Bump tabulate from 0.8.10 to 0.9.0 by @dependabot in #1610
- Log the number of GPUs and nodes Composer running on. by @eracah in #1604
- Update MLPerfCallback for v2.1 by @hanlint in #1607
- Remove object store cls by @dakinggg in #1606
- Add LAMB Optimizer by @hanlint in #1613
- Mmdet adapter by @A-Jacobson in #1545
- Fix mmdet typo by @Landanjs in #1618
- update torchmetrics requirement by @hanlint in #1620
- Add distributed sampler error by @mvpatel2000 in #1598
- Landan/deeplabv3 ade20k example by @Landanjs in #1593
- Upgrade CodeQL Action to version 2 by @karan6181 in #1628
- Blurpool idempotent by @mvpatel2000 in #1625
- Defaulting streaming dataset version to 2 by @karan6181 in #1616
...
v0.10.1
🚀 Composer v0.10.1
Composer v0.10.1 is released! Install via pip
:
pip install --upgrade mosaicml==0.10.1
New Features
-
𐄷 Weight Standardization
Weight Standardization reparametrizes convolutional weights such that the fan-in dimensions have zero mean and unit standard deviation. This could slightly improve performance at the expensive of 5% lower throughput. This has been used in several papers to train with smaller batch sizes, with normalization layers besides batch norm, and for transfer learning.
Using Weight Standardization with the Composer Trainer:
import composer # Apply Weight Standardization (when training is initialized) weight_std = composer.algorithms.WeightStandardization() # Train with Weight Standardization trainer = composer.trainer.Trainer( ... algorithms=[weight_std] ) trainer.fit()
Using Weight Standardization with the Composer functional interface:
import composer from torchvision.models import resnet50 my_model = resnet50() # Apply weight standardization to model my_model = composer.functional.weight_standardization(my_model)
Please see the Weight Standardization Method Card for more details.
Bug Fixes
- Fix for checkpoints not being saved automatically at the end of a run (#1552)
- Fix Onnx export for Composer HuggingFaceModels (#1557)
- Fix for MIoU metric producing NaN's (#1558)
- CometML logger documentation updates and fixes (#1567, #1570, #1571)
- WandB image visualizer fix (#1591)
What's Changed
- Update evaluate_periodically() when eval interval is of type Duration by @karan6181 in #1523
- Quality of life updates to EMA by @coryMosaicML in #1524
- Add ADE20K and COCO v2 dataset behind a version flag by @karan6181 in #1528
- Pinned setuptools version to fix distutils version error by @karan6181 in #1536
- Less strict name formatting by @hanlint in #1535
- Defaulting streaming dataset version to 1 and add a deprecation warning by @karan6181 in #1532
- Changing 'stable' to 'latest' in notebooks in examples by @bcui19 in #1534
- Bump furo from 2022.6.21 to 2022.9.15 by @dependabot in #1540
- Bump fasteners from 0.17.3 to 0.18 by @dependabot in #1538
- Add Pandoc to Docker images, bump version to 2.19.2 by @bandish-shah in #1550
- Removed streaming version 2 from yaml since version 1 is default by @karan6181 in #1551
- Bump ipykernel from 6.15.2 to 6.15.3 by @dependabot in #1548
- Bump yamllint from 1.27.1 to 1.28.0 by @dependabot in #1546
- Bump traitlets from 5.3.0 to 5.4.0 by @dependabot in #1539
- Object Store Logger Race Condition + EMA Fix by @mvpatel2000 in #1552
- Adding in erroring for when using GradMonitor and DeepSpeed by @bcui19 in #1555
- Bump pypandoc from 1.8.1 to 1.9 by @dependabot in #1559
- Update context to raise errror by @mvpatel2000 in #1561
- Fix MIoU metric when
self.total_union==0
by @abhi-mosaic in #1558 - Move dataloader
initialize_object
to factory methods by @hanlint in #1510 - Weight Standardization method by @Landanjs in #1562
- Update comet links to include query params and point to main site by @dakinggg in #1567
- remove dead line in alibi by @mvpatel2000 in #1568
- GLU Fixes by @mvpatel2000 in #1564
- Add FSDP strategy by @abhi-mosaic in #1553
- Comet example by @dakinggg in #1570
- Add missing _enabled flag, post_close, and clean up comet ml tests by @dakinggg in #1571
- Consistent Method Card Style by @growlix in #1407
- add missing return in context by @mvpatel2000 in #1574
- Remove eval batch split by @mvpatel2000 in #1576
- Fix Onnx Export for Composer HuggingFaceModels by @nik-mosaic in #1557
- Revert checkpoint rename by @hanlint in #1579
New Contributors
Full Changelog: v0.10.0...v0.10.1
v0.10.0
🚀 Composer v0.10.0
Composer v0.10.0 is out! This latest release adds support for CometML Experiment tracking, automatic selection of evaluation batch size, API enhancements for Evaluation/Logging/Metrics and a preview of our new streaming datasets repository!
pip install --upgrade mosaicml==0.10.0
New Features
-
☄️ Comet Experiment Tracking (#1490)
We've added support for the popular Comet experiment tracker! To enable, simply create the logger and pass it to the
Trainer
object at initialization:from composer import Trainer from composer.loggers import CometMLLogger cometml_logger = CometMLLogger() trainer = Trainer( ... loggers=[cometml_logger], )
Please see our Logging and CometMLLogger docs pages for details on usage.
-
🪄 Automatic Evaluation Batch Size Selection (#1417)
Composer now supports
eval_batch_size='auto'
, which will choose the right evaluation batch size to avoid CUDA OOMs! Now, in conjunction withgrad_accum='auto'
, you can run the same code on any hardware with no changes necessary. This makes it easy to add evaluation to a training script without having to pick and choose the right batch sizes to avoid CUDA OOMs. -
🎯 Evaluation API Changes (#1479)
The Evaluation API has been updated to be consistent with the Trainer API. If the
eval_dataloader
was provided to the Trainer during initialization,eval
can be invoked without needing to provide anything additional:trainer = Trainer( eval_dataloader=... ) trainer.eval()
Alternatively, the
eval_dataloader
can be passed directly to theeval()
method:trainer = Trainer( ... ) trainer.eval( eval_dataloader=... )
The
eval_dataloader
can be a pytorch dataloader, or for multiple metrics, a list ofEvaluator
objects. -
🪵 Simplified Logging (#1416)
We've significantly simplified our internal logging interface:
- Removed the use of
LogLevel
throughout the logging, which was a mostly unused feature. Filtering logs are the responsibility of the logger. - For better compatibility with external logging interfaces such as CometML or Weights & Biases, loggers now support the following methods:
log_metrics
,log_hyperparameters
, andlog_artifacts
. Previous calls todata_fit, data_epeoch, ..
have been removed.
- Removed the use of
-
🎯 validate --> eval_forward (#1411 , #1419)
Previously,
ComposerModel
implemented thevalidate(batch: Any) -> Tuple[Any, Any]
method which returns an(input, target)
tuple, and the Trainer handles updating the metrics. Inv0.10
, we return the metrics updating control to the user.Now, models instead implement
def eval_forward(batch: Any)
which returns the outputs of evaluation, and alsodef update_metric(batch, outputs, metric)
which updates the metric.An example implementation for classification can be found in our
ComposerClassifer
base class:def update_metric(self, batch: Any, outputs: Any, metric: Metric) -> None: _, targets = batch metric.update(outputs, targets) def eval_forward(self, batch: Any, outputs: Optional[Any] = None) -> Any: return outputs if outputs is not None else self.forward(batch)
-
🕵️♀️ Evaluator changes
The
Evaluator
class now stores evaluation metric names instead of metric instances. For example:glue_mrpc_task = Evaluator( label='glue_mrpc', dataloader=mrpc_dataloader, metric_names=['BinaryF1Score', 'Accuracy'] )
These metric names are matched against the metrics returned by the
ComposerModel
. The metric instances are now stored as deep copies in theState
class asstate.train_metrics
orstate.eval_metrics
. -
🚧 Streaming Datasets Repository Preview
We're in the process of splitting out streaming datasets into it's own repository! Streaming datasets is a high-performance drop-in replacement for Torch
IterableDataset
objects and enables you to stream your training data from cloud based object stores. For an early preview, please checkout the Streaming repo. -
❌ YAHP deprecation
We are deprecating support for yahp, our hyperparameter configuration tool. Support for this will be removed in the following minor version release of Composer. We recommend users migrate to OmegaConf, or Hydra as tools.
Bug Fixes
- Documentation fixes (#1408, #1422, #1425, #1413, #1432, #1403, #1426, #1396, #1446, #1466, #1443)
- Upgrade WandB version (#1440)
- fix import (#1442)
- fix wrong extra deps group (#1449)
- wandb bug fix (#1488)
- Reset train metrics every batch (#1496)
- fix auto grad accum (#1515)
- Fix compression file remote download exception handling (#1526)
- Add Pandoc to Docker images, bump version to 2.19.2 (#1550)
What's Changed
- current metrics docs by @A-Jacobson in #1402
- merge nlp+hf notebooks by @A-Jacobson in #1406
- Add break epoch exception by @mvpatel2000 in #1415
- Upgrade to torch 1.12.1 by @abhi-mosaic in #1409
- Metrics refactor pt1 by @ishanashastri in #1411
- Use state algos by @mvpatel2000 in #1412
- Add default ignore index by @moinnadeem in #1421
- Update default hparams for ResNet model card by @abhi-mosaic in #1423
- update colout link in custom speedup notebook by @A-Jacobson in #1408
- Clean up prose in key files by @dblalock in #1422
- Relax codeowners by @bandish-shah in #1424
- Fix typo by @Landanjs in #1425
- Fix pre-commit checks failing on fresh checkout of dev by @dblalock in #1414
- Have docs use preferred import paths, not longest import paths by @dblalock in #1413
- Fix missing indent by @Landanjs in #1432
- eval_batch_size=auto by @mvpatel2000 in #1417
- Simplify helper for conflicting files by @hanlint in #1427
- add install from dev instructions by @A-Jacobson in #1403
- Style/tone consistency update for tutorial notebooks by @alextrott16 in #1426
- Dynamic quantization + minor improvements in inference APIs by @dskhudia in #1433
- Upgrade WandB version by @moinnadeem in #1440
- Log multiple losses by @Landanjs in #1375
- Fix attribute by @mvpatel2000 in #1442
- Expand evaluation doc by @alextrott16 in #1396
- Metrics Refactor Part 2 by @ishanashastri in #1419
- Create dependabot.yml by @mvpatel2000 in #1448
- Methods overview fix by @growlix in #1446
- Bump custom-inherit from 2.3.2 to 2.4.0 by @dependabot in #1451
- Bump junitparser from 2.4.3 to 2.8.0 by @dependabot in #1453
- Update moto[s3] requirement from <3.2,>=3.1.12 to >=4.0.1,<5 by @dependabot in #1450
- Update monai requirement from <0.9,>=0.8.0 to >=0.9.0,<0.10 by @dependabot in #1452
- Update torch-optimizer requirement from <0.2,>=0.1.0 to >=0.3.0,<0.4 by @dependabot in #1454
- Bump cryptography from 37.0.2 to 37.0.4 by @dependabot in #1457
- Bump sphinxext-opengraph from 0.6.1 to 0.6.3 by @dependabot in #1458
- Bump coverage[toml] from 6.3.2 to 6.4.4 by @dependabot in #1460
- Bump nbsphinx from 0.8.8 to 0.8.9 by @dependabot in #1459
- Fix incorrect deps group in
streaming
requirement by @hanlint in #1449 - Logger Destination Refactor by @eracah in #1416
- Bump sphinx-markdown-tables from 0.0.15 to 0.0.17 by @dependabot in #1463
- Bump traitlets from 5.1.1 to 5.3.0 by @dependabot in #1462
- Bump vit-pytorch from 0.27 to 0.35.8 by @dependabot in #1465
- Bump furo from 2022.3.4 to 2022.6.21 by @dependabot in #1467
- Bump ipykernel from 6.9.2 to 6.15.1 by @dependabot in #1470
- Bump pytest from 7.1.0 to 7.1.2 by @dependabot in #1469
- Bump sph...
v0.9.0
🚀 Composer v0.9.0
Excited to share the release of Composer v0.9.0, which comes with an Inference Export API, beta support for Apple Silicon and TPU training, as well as expanded usability of NLP-related speed-up methods. This release includes 175 commits from 34 contributors, including 10 new contributors 🙌 !
pip install --upgrade mosaicml==0.9.0
Alternatively, install Composer with Conda:
conda install -c mosaicml mosaicml=0.9.0
New Features
-
📦 Export for inference APIs
Train with Composer and deploy anywhere! We have added a dedicated export API as well as an export training callback to allow you to export Composer-trained models for inference, supporting popular formats such as torchscript and ONNX.
For example, here’s how to export a model in torchscript format:
from composer.utils import export_for_inference # Invoking export with a trained model export_for_inference(model=model, save_format='torchscript', save_path=model_save_path)
Here’s an example of using the training callback, which automatically exports the model at the end of training to ONNX format:
from composer.callbacks import ExportForInferenceCallback # Initializing Trainer with the export callback callback = ExportForInferenceCallback(save_format='onnx', save_path=model_save_path) trainer = Trainer(model=model, callbacks=callback, train_dataloader=dataloader, max_duration='10ep') # Model will be exported at the end of training trainer.fit()
Please see our Exporting for Inference notebook for more information.
-
📈 ALiBi support for BERT training
You can now use ALiBi (Attention with Linear Biases; Press et al., 2021) when training BERT models with Composer, delivering faster training and higher accuracy by leveraging shorter sequence lengths.
ALiBi improves the quality of BERT pre-training, especially when pre-training uses shorter sequence lengths than the downstream (fine-tuning) task. This allows models with ALiBi to reach higher downstream accuracy with less pre-training time.
Example of using ALiBi as an algorithm with the Composer Trainer:
# Create an instance of a BERT masked language model model = composer.models.create_bert_mlm() # Apply ALiBi (when training is initialized) alibi = composer.algorithms.alibi(max_sequence_length=1024) # Train with ALiBi trainer = composer.trainer.Trainer( model=model, train_dataloader=train_dataloader, algorithms=[alibi] ) trainer.fit()
Example using the Composer Functional API:
import composer.functional as cf # Create an instance of a BERT masked language model model = composer.models.create_bert_mlm() # Apply ALiBi and expand the model's maximum sequence length to 1024 cf.apply_alibi(model=model, max_sequence_length=1024)
AliBi can also now be extended to work with custom models by registering your attention and embedding layers. Please see our ALiBi method card for more information.
-
🧐 Entry point for GLUE tasks pre-training and fine-tuning
You can now easily pre-train and fine-tune NLP models across all GLUE (General Language Understanding Evaluation) tasks through one simple entry point! The entry point handles model saving and loading, spawns GLUE tasks in parallel across all available GPUs, and delivers a highly efficient evaluation of model performance.
Example of launching the entrypoint:
# This runs pre-training followed by fine-tuning. # --training_scheme can take either pretrain, finetune, or all depending on the task! python run_glue_trainer.py -f glue_example.yaml --training_scheme all
Please see our GLUE entrypoint notebook for more information.
-
🤖 TPU support (in beta)
You can now use Composer to train your models on TPUs! Support is now available in Beta, and currently only supports single-core TPU training. Try it out, explore optimizations, and share your feedback and feature requests with us so we can make it better for you and for the community.
To use TPUs with Composer, simply specify a
tpu
device:# Set device to `tpu` trainer = composer.trainer.Trainer( model=model, train_dataloader=train_dataloader, max_duration=train_epochs, device='tpu') # Run fit trainer.fit()
Please see our Training with TPUs notebook for more information.
-
🍎 Apple Silicon support (beta)
Leverage Apple Silicon chips to train your models with Composer by providing the
device='mps'
argument:trainer = Trainer( ..., device='mps' )
We use the latest PyTorch MPS backend to execute the training. This requires torch version ≥1.12, and Max OSX 12.3+.
For more information on training with Apple M chips, see the PyTorch 1.12 blog and our API Reference for Composer specific details.
-
🚧 Contrib repository
Got a new method idea, or published a paper and want those methods to be easily accessible? We’ve created the
mcontrib
repository, with a lightweight process to contribute new algorithms. We’re happy to work directly with you to benchmark these methods and eventually “promote” them to Composer for use by end customers.Please checkout the README for details on how to contribute a new algorithm. For more details on how to write speed-up methods, see our notebook on custom speed-up methods.
Additional API Changes
-
🔢 Passes Module
The order in which algorithms are run matters significantly during composition. With this release we refactored algorithm passes into their own
passes
module. Users can now register custom passes (for custom algorithms) with the Engine. Please see #1377 for more information. -
🗄️ Default Checkpoint Extension
The CheckpointSaver now defaults to using the
*.pt
extension for checkpoint fienames. Please see #1370 for more information. -
👁️ Models Refactor
Most vision models (ResNet, MNIST, ViT, EfficientNet) have been refactored from classes to a factory function. For example
ComposerResNet
->composer_resnet
.# before from composer.models import ComposerResNet model = ComposerResNet(..) from composer.models import composer_resnet # after model = composer_resnet(..)
The same refactor has been done for NLP as well, e.g.
BERTModel
->create_bert_mlm
andcreate_bert_classification
. -
➕ Misc API Changes
BreakEpochException
has been removed.state.is_model_deepspeed
has been moved tocomposer.utils.is_model_deepspeed
.- Helper function
monitored_barrier
has been added tocomposer
distributed.
Bug Fixes
- Add informative error for infer batch size issues (#1401)
- Fix ImagenetDatasetHparams bug (#1392), resolves #1111
- Fix hparams error condition checking (#1394)
- Fix AMP resumption with grad scaler (#1376)
- Auto Grad Accum Cache Clearing (#1380), fixes issue reported in #1331
- Fix default precision (#1369)
- Fix the profiler on multi-node training (#1358), resolves #1270
- Retry SFTP on Size Mismatch (#1300)
- Fix scheduler edge cases (#1350), resolves #1077
- Fix a race condition in the object store logger (#1328)
- Fix WandB load from checkpoint (#1326)
- Fix Notebook Progress Bars (#1313)
Commits
What's Changed
- Fix DeepSpeed typo in docstring by @abhi-mosaic in #1188
- Move grad_accum logging to every step by @coryMosaicML in #1187
- Update STYLE_GUIDE with details on Documentation by @bandish-shah in #1183
- ProgressBar Units by @hanlint in #1190
- Added Xavier Normal initializer by @vladd-i in #1196
- Updated cost figure by @nqn in #1180
- Remove algorithm yamls by @hanlint in #1193
- Fix the Composer Launch Script for the C...