Abra merge test (#2870)

* Bump torch to 2.1.1 version (#2717) * Add more info when run doesnt complete (#2751) * Lower sequence generation length on code gen to be dependent on max canonical solution length (#2682) * sequentialize generations_per_sample * fix bug * lower generation length * lower generation length * lower generation length * fix gen len * restore * restore * restore * fix tests * fix test * Remove flatten params (#2761) * remove flatten params * simplify tests * simplify tests * clean * fix more tests * rerun tests * speed up icl * fix tests * fix cpu tests * add more fixtures * fix tests * token count * fix vocab size * remove logger * remove clears * fix mosaicml logger * change codeowners * clean up codeowners * rerun tests * shrink dataset * fix tests * fix test * rerun tests * fix tests * fix tests * fix seed * set to 0 * rerun tests * rerun tests * change threshold * rerun tests * rerun tests * logs * remove changes * logs * logs * remove logs * rerun tests * rerun tests * logs * rerun * logs * rerun * rerun * rerun tests * many more logs * rerun tests * strip logs * enable tests * remove opt * rerun tests * add test * lint * rerun tests * fix lint * lint * filter warnings * rerun tests * fixture * add fixture * change * logs * rerun tests * add logs * rerun tests * fixture * lint * lint * rerun tests * fix ignore warning * logs * regex * regex * regex * fix * logs * reformat * fix lint (#2767) * lint (#2768) * Use time.tokens for speedmonitor instead of dataset length (#2762) * change token math * tokens * add test * fix tests * remove exception (#2759) * time to clean up time parsing 😉 (#2770) * time to clean up time parsing * fix type error * updates * Upgrade RunConfig compute specification (#2772) * Upgrade RunConfig compute specification * extra cluster * Use async logging in MLflowLogger (#2693) * async mlflow logging Signed-off-by: chenmoneygithub <chen.qian@databricks.com> * small fix Signed-off-by: chenmoneygithub <chen.qian@databricks.com> * clean up * fix test * fix tests * deflake * pin mlflow --------- Signed-off-by: chenmoneygithub <chen.qian@databricks.com> * Fix FSDP _param_init_fn to not reinit parameters multiple times (#2765) * Gate FSDP param init test on torch 2.1 (#2774) * Parallelize OCI multipart download (#2750) * [UCVolumes] Add support for list API (#2769) * Add the memory timeline profiling support through the PyTorch profiler. (#2771) * v1 * fix issues * add logs * change names * comment * add device * uncomment original trace * add custome plot * fix pyright * Update composer/profiler/torch_profiler.py Co-authored-by: Charles Tang <j316chuck@users.noreply.github.com> * address comments * fix code check * fix formatting * address comments * add unit test * fix check * fix check * fix check * fix check * fix print * add test comment * add test comment --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> Co-authored-by: Charles Tang <j316chuck@users.noreply.github.com> * Improve torch memory profiling arguments processing (#2777) * improve torch profile args * improve torch profile args * change default torch_prof_memory_filename * add memory profiling arg test * fix check * fix check * fix check * fix check * fix check * fix check * Add platform AWS and bump aws ofi nccl version (#2776) * Extend checkpoint loading to accept a validation function (#2726) * Fix checkpoint validation tests for torch 1.13 (#2779) * fix checkpoint validation tests for torch 1.13 * Fix * Bump version to 0.17.2 (#2780) * bump version * 0.17.2 * update matrix * bump transformers version (#2781) * Bump sphinxext-opengraph from 0.9.0 to 0.9.1 (#2784) Bumps [sphinxext-opengraph](https://github.com/wpilibsuite/sphinxext-opengraph) from 0.9.0 to 0.9.1. - [Release notes](https://github.com/wpilibsuite/sphinxext-opengraph/releases) - [Commits](wpilibsuite/sphinxext-opengraph@v0.9.0...v0.9.1) --- updated-dependencies: - dependency-name: sphinxext-opengraph dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump coverage[toml] from 7.3.0 to 7.3.3 (#2783) Bumps [coverage[toml]](https://github.com/nedbat/coveragepy) from 7.3.0 to 7.3.3. - [Release notes](https://github.com/nedbat/coveragepy/releases) - [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst) - [Commits](nedbat/coveragepy@7.3.0...7.3.3) --- updated-dependencies: - dependency-name: coverage[toml] dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 (#2785) Updates the requirements on [torch](https://github.com/pytorch/pytorch) to permit the latest version. - [Release notes](https://github.com/pytorch/pytorch/releases) - [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md) - [Commits](pytorch/pytorch@v1.13.1...v2.1.2) --- updated-dependencies: - dependency-name: torch dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [UCVolumes] Rely on databricks-sdk auth for the right requirements (#2789) * Enable system metrics in mosaic mlflow logger (#2775) * Enable system metrics in mosaic mlflow logger * remove fixture * Update composer/loggers/mlflow_logger.py Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * Update composer/loggers/mlflow_logger.py Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * Update composer/loggers/mlflow_logger.py Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * Update parse_uri (#2787) * default-no-memory-timeline (#2790) * Add eot token to ICL generate kwargs (#2782) * add custome gen kwargs and stopping on eos token * modify test * modify test * finish * finish * finish * finish * Add nightly image for torch 2.2.0 12-20-23 (#2791) * Add torch nightly 12-13 (#2792) * Add process group as arg to FSDP (#2794) * add test * only cast if PG is specified * add to docstring * filter warning * filter warning * docs * support lists * remove warnings * lint * hsdp monkeypatch * logs * change log * fix patch * typo * clean up logs * Bump coverage[toml] from 7.3.3 to 7.3.4 (#2798) Bumps [coverage[toml]](https://github.com/nedbat/coveragepy) from 7.3.3 to 7.3.4. - [Release notes](https://github.com/nedbat/coveragepy/releases) - [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst) - [Commits](nedbat/coveragepy@7.3.3...7.3.4) --- updated-dependencies: - dependency-name: coverage[toml] dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix load_ignore_keys with rng (#2803) * fix rng load * lint * Bump ipykernel from 6.26.0 to 6.28.0 (#2806) Bumps [ipykernel](https://github.com/ipython/ipykernel) from 6.26.0 to 6.28.0. - [Release notes](https://github.com/ipython/ipykernel/releases) - [Changelog](https://github.com/ipython/ipykernel/blob/main/CHANGELOG.md) - [Commits](ipython/ipykernel@v6.26.0...v6.28.0) --- updated-dependencies: - dependency-name: ipykernel dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump junitparser from 3.1.0 to 3.1.1 (#2805) Bumps [junitparser](https://github.com/weiwei/junitparser) from 3.1.0 to 3.1.1. - [Changelog](https://github.com/weiwei/junitparser/blob/master/CHANGELOG.md) - [Commits](weiwei/junitparser@3.1.0...3.1.1) --- updated-dependencies: - dependency-name: junitparser dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump pytest from 7.4.3 to 7.4.4 (#2807) Bumps [pytest](https://github.com/pytest-dev/pytest) from 7.4.3 to 7.4.4. - [Release notes](https://github.com/pytest-dev/pytest/releases) - [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst) - [Commits](pytest-dev/pytest@7.4.3...7.4.4) --- updated-dependencies: - dependency-name: pytest dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Avoid futures on close for MosaicML logger (#2804) * avoid futures on close * typo * logs * logs * check (#2812) * Better communication computation overlap (#2811) * patched torch * fixed torch imports * fixed torch imports * fixed torch imports * patching through composer * patching through composer * patching typingr * comment added * don't patch torch 2.1.0 * patch torch 2.1.1 and 2.2.0 * linting fix * Improve error message for speed monitor (#2801) * fix flops * stacklevel * bump torch version (#2814) * bump vision (#2815) * fix rng load (#2816) * Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. (#2817) * patched torch * fixed torch imports * fixed torch imports * fixed torch imports * patching through composer * patching through composer * patching typingr * comment added * don't patch torch 2.1.0 * patch torch 2.1.1 and 2.2.0 * linting fix * waiting on computation stream from unshard stream * waiting on computation stream from unshard stream * less waiting * no waiting * all unshard streams wait on computation stream now * 2.2.0 dev change * fix profiler (#2818) * Bump traitlets from 5.13.0 to 5.14.1 (#2822) Bumps [traitlets](https://github.com/ipython/traitlets) from 5.13.0 to 5.14.1. - [Release notes](https://github.com/ipython/traitlets/releases) - [Changelog](https://github.com/ipython/traitlets/blob/main/CHANGELOG.md) - [Commits](ipython/traitlets@v5.13.0...v5.14.1) --- updated-dependencies: - dependency-name: traitlets dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * All unshard streams wait on computation every step (#2823) * patched torch * fixed torch imports * fixed torch imports * fixed torch imports * patching through composer * patching through composer * patching typingr * comment added * don't patch torch 2.1.0 * patch torch 2.1.1 and 2.2.0 * linting fix * waiting on computation stream from unshard stream * waiting on computation stream from unshard stream * less waiting * no waiting * all unshard streams wait on computation stream now * 2.2.0 dev change * correct waiting on computation stream * fsdp state typiung * patching root pre forward * patching root pre forward * fsdp state typing * patch forward * correct waiting * linting * Add encoding=utf-8 (#2824) * Fix import for daily test (#2826) * patched torch * fixed torch imports * fixed torch imports * fixed torch imports * patching through composer * patching through composer * patching typingr * comment added * don't patch torch 2.1.0 * patch torch 2.1.1 and 2.2.0 * linting fix * waiting on computation stream from unshard stream * waiting on computation stream from unshard stream * less waiting * no waiting * all unshard streams wait on computation stream now * 2.2.0 dev change * correct waiting on computation stream * fsdp state typiung * patching root pre forward * patching root pre forward * fsdp state typing * patch forward * correct waiting * linting * daily test change * daily test fix * [MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore (#2802) * Implementation of MLFlowObjectStore * Update object store test settings * Import mlflow dependencies inline * Fix tests and ignore some pyright * Bugfix * Enforce experiment and run in get_artifact_path * Update placeholders * Make logs debug instead of info * Minor PR comments * MLflow casing * tracking_uri fixes * Update comments * Update placeholders * Fix tests * Fix pyright * Use tempfile for temp dirs * Read tracking uri env var directly * Remove dist from MLFlowObjectStore --------- Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * Remove fused layernorm (already deprecated for 2 versions) (#2827) * remove fused layernorm * remove import * remove import * remove * fix * remove docs * all * fix * filter warnings * norm * lint * refactor --------- Co-authored-by: Your Name <you@example.com> * checkpoint saver tracks all checkpoints/intervals in state (#2819) * checkpoint tracking state * fix some tests * Update tests/callbacks/test_checkpoint_saver.py * Checkpoint itself should be included in state, dont pickle timestamp object * patch the key error (doesnt fix the bug though :sad:) * avoid slashes in state, adjust tests * fix gpu test, probably * formatting * feedback * add a comment * Apply suggestions from code review Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * code-quality timeout update (#2830) Timed out after 10 minutes here https://github.com/mosaicml/composer/actions/runs/7465107219/job/20313553654?pr=2819 Bumps runtime up to 15min * [S] Fix how single value tensors are logged (#2831) Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * Adds DTensor Support (#2821) * fixes to get dtensor to work * more fixes * Change state dict materialization for new version of torch * get load working for new set_state_dict api * use device_mesh * Add fsdp init monkeypatch for DTensor * Add checkpoint profiling logs * attempt * working single node * fix optimizer * allow 3d device mesh * attempt to use different pg during 3d mesh save * undo 3d mesh changes * load_state_dict -> load * allow parent mesh in FSDP init * allow override of force_sync_module_states * remove unnecessary exit * ignore _validate_and_get_shard_state() * save/load hsdp-moe working * remove prints * v1 * v2 * lint * add more tests * switch to PRs * ignore warning * fix lint * version error * fix version * fix state dict * update versions * lint * lint * disable lint for mosaic fsdp utils * remove bad line * move around for legacy * device mesh * ignore warning * fix import * always init * fix error * fix load planner * remove * fix lint * lint * delay state dict * test checkpoint * checkpoint * fix cpu tests * fix rotate tests * fix precision * lint * fix alibi * cleanup * cleanup * remove force sync * fix type * merge * lint * fix gpt * comment * fix test * lint * minor optimizations * Update composer/core/state.py Co-authored-by: Evan Racah <evan@mosaicml.com> * revert tests --------- Co-authored-by: Evan Racah <ejracah@gmail.com> Co-authored-by: Abhinav Venigalla <abhi.venigalla@databricks.com> Co-authored-by: root <23239305+b-chu@users.noreply.github.com> Co-authored-by: Abhinav Venigalla <abhi@mosaicml.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: Evan Racah <evan@mosaicml.com> * Remove duplicate checkpoint verifications (#2828) * Fix seed for FSDP wrap (#2833) * first try * add context * lint * more lint * remove comment --------- Co-authored-by: Daniel King <daniel@mosaicml.com> Co-authored-by: Your Name <you@example.com> * Remove fsdp patch for comm overlap (#2836) * allow hsdp (#2838) * Bump torch 2.1.2 (#2840) * bump torch * bump * bump * Upgrade pyright to 1.1.310 (#2841) * [MLFlowObjectStore] [2/2] Support checkpointing with MLFlow (#2810) * Support checkpoint uploads to MLFlow (untested) Use MLFlow run tag for autoresume Add MLFlowLogger test for existing composer run tag * Try formatting mlflow save folder after INIT Make MLFlow experiment and run ID available on all ranks Fix path issue Format mlflow placeholders in remote filenames * Unit tests for partial_format * Log mlflow info as hyperparams * partial_format doc update * Fix formatting * Pull distributed logic out of MLFlowObjectStore Add debug tracebacks Bugfix Add path to debug info Try fixing RUD object store init Pyright * Partial format in format_name helpers * Fix import * Add extra partial_format test * Fix mlflow RUD check * Fix test pyright No longer expect KeyError for format_with_dist using partial_format Refactor partial_format for readability * Max iters on partial_format * Fix partial_format * Clean up * fix test import * Fix test * update nightly to torch 2.3 (#2842) * update nightly to torch 2.3 * tighten --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * Pin sphinxcontrib applehelp (#2854) * pin release * bump * break pypi * tighter pin * pin * pin * pin * Update setup.py (#2855) * Torch 2.3 patch (#2849) * add monkeypatch for verify_options * patch * fix * fix * partial precommit * bit of cleanup * doc * debug * fix version pinning * precommit * checkdown * lint --------- Co-authored-by: Evan Racah <ejracah@gmail.com> Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 (#2866) Updates the requirements on [mosaicml-cli](https://github.com/mosaicml/mosaicml-cli) to permit the latest version. - [Commits](https://github.com/mosaicml/mosaicml-cli/commits) --- updated-dependencies: - dependency-name: mosaicml-cli dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Rewrite to use individual state functions (#2860) * checkdown * checkdown * lint * fix * load ignore keys * fix * resolve comments * fix load ignore keys * offload * fix gate * merge * lint * use flag * force trye * Add custom stopping criteria to ICL generate tasks (#2800) * add custome gen kwargs and stopping on eos token * modify test * modify test * finish * finish * finish * finish * finish pr * implement early stop * add tesT * fix bug * bug fix * add keys * diff split * fix typo * fix precommit * fix precommit * fix precommit * fix precommit * fix precommit * fix precommit * fix conditional import * add nlp metrics * remove code gen changes * fix nits --------- Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * Add save_ignore_keys (#2868) * comment * add it * debug * add the keys * debug * debug * remove print statement * docs and tests * fix tests --------- Co-authored-by: Daniel King <daniel@mosaicml.com> --------- Signed-off-by: chenmoneygithub <chen.qian@databricks.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Charles Tang <j316chuck@users.noreply.github.com> Co-authored-by: Anna <anna@mosaicml.com> Co-authored-by: Jeremy D <115047575+bmosaicml@users.noreply.github.com> Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> Co-authored-by: Chen Qian <chenmoney@google.com> Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> Co-authored-by: coryMosaicML <83666378+coryMosaicML@users.noreply.github.com> Co-authored-by: Harsh Panchal <68880048+panchalhp-db@users.noreply.github.com> Co-authored-by: willgleich <22464726+willgleich@users.noreply.github.com> Co-authored-by: Irene Dea <deaairene@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: snarayan21 <saaketh@mosaicml.com> Co-authored-by: Jerry Chen <jerry.chen@databricks.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: Evan Racah <ejracah@gmail.com> Co-authored-by: Abhinav Venigalla <abhi.venigalla@databricks.com> Co-authored-by: root <23239305+b-chu@users.noreply.github.com> Co-authored-by: Abhinav Venigalla <abhi@mosaicml.com> Co-authored-by: Evan Racah <evan@mosaicml.com> Co-authored-by: Daniel King <daniel@mosaicml.com>
mosaicml · Jan 17, 2024 · 76a0e43 · 76a0e43
1 parent cf6d9e9
commit 76a0e43
Show file tree

Hide file tree

Showing 154 changed files with 4,866 additions and 1,820 deletions.
diff --git a/.github/mcli/mcli_pytest.py b/.github/mcli/mcli_pytest.py
@@ -6,7 +6,7 @@
 import argparse
 import time
 
-from mcli import RunConfig, RunStatus, create_run, follow_run_logs, stop_run, wait_for_run_status
+from mcli import RunConfig, RunStatus, create_run, follow_run_logs, wait_for_run_status
 
 if __name__ == '__main__':
 
@@ -67,8 +67,6 @@
 
     export COMMON_ARGS="-v --durations=20 -m '{args.pytest_markers}' {s3_bucket_flag} {clear_tmp_path_flag}"
 
-    export PYTHONUNBUFFERED=1
-
     make test PYTEST='{args.pytest_command}' EXTRA_ARGS="$COMMON_ARGS --codeblocks"
 
     make test-dist PYTEST='{args.pytest_command}' EXTRA_ARGS="$COMMON_ARGS" WORLD_SIZE=2
@@ -79,13 +77,25 @@
     '''
     config = RunConfig(
         name=name,
-        cluster=args.cluster,
-        gpu_type=args.gpu_type,
-        gpu_num=args.gpu_num,
+        compute={
+            'cluster': args.cluster,
+            'gpu_type': args.gpu_type,
+            'gpus': args.gpu_num
+        },
         image=args.image,
         integrations=[git_integration],
         command=command,
         scheduling={'max_duration': args.timeout / 60 / 60},
+        env_variables=[
+            {
+                'key': 'MOSAICML_PLATFORM',
+                'value': 'False',
+            },
+            {
+                'key': 'PYTHONUNBUFFERED',
+                'value': '1',
+            },
+        ],
     )
 
     # Create run
@@ -102,7 +112,7 @@
         print(line, end='')
 
     print('[GHA] Run completed. Waiting for run to finish...')
-    run = wait_for_run_status(run, status='completed')
+    run = wait_for_run_status(run, status=RunStatus.COMPLETED)
 
-    # Fail if command exited with non-zero exit code or timed out
-    assert run.status == RunStatus.COMPLETED
+    # Fail if command exited with non-zero exit code or timed out (didn't reach COMPLETED)
+    assert run.status == RunStatus.COMPLETED, f'Run {run.name} did not complete: {run.status} ({run.reason})'
diff --git a/.github/workflows/code-quality.yaml b/.github/workflows/code-quality.yaml
@@ -18,7 +18,7 @@ defaults:
 jobs:
   code-quality:
     runs-on: ubuntu-20.04
-    timeout-minutes: 10
+    timeout-minutes: 15
     strategy:
       matrix:
         python_version:

diff --git a/.github/workflows/pr-cpu.yaml b/.github/workflows/pr-cpu.yaml
@@ -27,6 +27,11 @@ jobs:
             markers: 'not daily and not remote and not gpu and not vision and not doctest'
             pytest_command: 'coverage run -m pytest'
             composer_package_name: 'mosaicml'
+          # - name: 'cpu-3.10-2.2'
+          #   container: mosaicml/pytorch:2.2.0_cu121-nightly20231213-python3.10-ubuntu20.04
+          #   markers: 'not daily and not remote and not gpu and not vision and not doctest'
+          #   pytest_command: 'coverage run -m pytest'
+          #   composer_package_name: 'mosaicml'
           - name: 'cpu-vision'
             container: mosaicml/pytorch_vision:1.13.1_cpu-python3.10-ubuntu20.04
             markers: 'not daily and not remote and not gpu and vision and not doctest'

diff --git a/.github/workflows/pr-gpu.yaml b/.github/workflows/pr-gpu.yaml
@@ -17,6 +17,11 @@ jobs:
             markers: 'not daily and not remote and gpu and (doctest or not doctest)'
             pytest_command: 'coverage run -m pytest'
             composer_package_name: 'mosaicml'
+          # - name: 'gpu-3.10-2.2'
+          #   container: mosaicml/pytorch:2.2.0_cu121-nightly20231213-python3.10-ubuntu20.04
+          #   markers: 'not daily and not remote and gpu and (doctest or not doctest)'
+          #   pytest_command: 'coverage run -m pytest'
+          #   composer_package_name: 'mosaicml'
     name: ${{ matrix.name }}
     if: github.repository_owner == 'mosaicml'
     with:

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -110,7 +110,7 @@ repos:
         types: [python]
         pass_filenames: false
         args: [--warnings]
-        additional_dependencies: ["pyright@1.1.256"]
+        additional_dependencies: ["pyright@1.1.310"]
   - repo: https://github.com/trufflesecurity/trufflehog.git
     rev: v3.40.0
     hooks:

diff --git a/CODEOWNERS b/CODEOWNERS
@@ -17,11 +17,11 @@
 # as an owner for all sections, so anyone on Composer Eng can approve any Composer PR
 # According to the CODEOWNER docs, the last match takes precedence, so @mosaicml/composer-team-eng
 # must be mentioned for each rule below.
-/composer/algorithms/ @dskhudia @mvpatel2000 @nik-mosaic
+/composer/algorithms/ @mosaicml/composer-team-eng
 /composer/cli/ @mosaicml/composer-team-eng
 /composer/datasets/ @mosaicml/composer-team-eng
-/composer/functional/ @dblalock @mvpatel2000
-/composer/loggers/ @eracah @dakinggg
+/composer/functional/ @mosaicml/composer-team-eng @dblalock
+/composer/loggers/ @mosaicml/composer-team-eng @eracah @dakinggg
 /composer/loss/ @mosaicml/composer-team-eng
 /composer/metrics/ @mosaicml/composer-team-eng
 /composer/models/ @mosaicml/composer-team-eng

diff --git a/composer/algorithms/__init__.py b/composer/algorithms/__init__.py
@@ -46,7 +46,6 @@ def apply(self, state: State, event: Event, logger: Logger):
 from composer.algorithms.cutout import CutOut
 from composer.algorithms.ema import EMA
 from composer.algorithms.factorize import Factorize
-from composer.algorithms.fused_layernorm import FusedLayerNorm
 from composer.algorithms.gated_linear_units import GatedLinearUnits
 from composer.algorithms.ghost_batchnorm import GhostBatchNorm
 from composer.algorithms.gradient_clipping import GradientClipping
@@ -79,7 +78,6 @@ def apply(self, state: State, event: Event, logger: Logger):
     'CutOut',
     'EMA',
     'Factorize',
-    'FusedLayerNorm',
     'GatedLinearUnits',
     'GhostBatchNorm',
     'GradientClipping',

diff --git a/composer/algorithms/alibi/attention_surgery_functions/__init__.py b/composer/algorithms/alibi/attention_surgery_functions/__init__.py
@@ -6,7 +6,8 @@
 from composer.utils import MissingConditionalImportError
 
 try:
-    from composer.algorithms.alibi.attention_surgery_functions import _bert, _gpt2  # pyright: reportUnusedImport=none
+    from composer.algorithms.alibi.attention_surgery_functions import _bert  # pyright: ignore[reportUnusedImport]
+    from composer.algorithms.alibi.attention_surgery_functions import _gpt2  # pyright: ignore[reportUnusedImport]
     from composer.algorithms.alibi.attention_surgery_functions.utils import policy_registry
 except ImportError as e:
     raise MissingConditionalImportError(extra_deps_group='nlp', conda_package='transformers') from e

diff --git a/composer/algorithms/alibi/attention_surgery_functions/_bert.py b/composer/algorithms/alibi/attention_surgery_functions/_bert.py
@@ -1,6 +1,7 @@
 # Copyright 2022 MosaicML Composer authors
 # SPDX-License-Identifier: Apache-2.0
 
+import copy
 import math
 from types import MethodType
 from typing import Optional, Tuple
@@ -20,13 +21,14 @@ def bert_embedding_converter(module: torch.nn.Module, module_index: int, max_seq
     """
     assert isinstance(module, (BertEmbeddings, RobertaEmbeddings))
     del module_index  # unused
-    zero_and_freeze_expand_position_embeddings(module,
+    new_module = copy.deepcopy(module)
+    zero_and_freeze_expand_position_embeddings(new_module,
                                                max_sequence_length,
                                                position_embedding_attribute='position_embeddings')
 
-    module_device = next(module.parameters()).device
-    module.register_buffer('position_ids', torch.arange(max_sequence_length).expand((1, -1)).to(module_device))
-    return module
+    module_device = next(new_module.parameters()).device
+    new_module.register_buffer('position_ids', torch.arange(max_sequence_length).expand((1, -1)).to(module_device))
+    return new_module
 
 
 @policy_registry.register(BertSelfAttention, RobertaSelfAttention)

diff --git a/composer/algorithms/colout/colout.py b/composer/algorithms/colout/colout.py
@@ -29,10 +29,12 @@
 __all__ = ['ColOut', 'ColOutTransform', 'colout_batch']
 
 
-def colout_batch(sample: Union[ImgT, Tuple[ImgT, ImgT]],
-                 p_row: float = 0.15,
-                 p_col: float = 0.15,
-                 resize_target: Union[bool, str] = 'auto') -> Union[ImgT, Tuple[ImgT, ImgT]]:
+def colout_batch(
+        sample: Union[ImgT, Tuple[ImgT, ImgT]],
+        p_row: float = 0.15,
+        p_col: float = 0.15,
+        resize_target: Union[bool,
+                             str] = 'auto') -> Union[torch.Tensor, ImgT, Tuple[Tensor, Tensor], Tuple[ImgT, ImgT]]:
     """Applies ColOut augmentation to a batch of images and (optionally) targets,
     dropping the same random rows and columns from all images and targets in a batch.
 
@@ -136,7 +138,10 @@ def __init__(self, p_row: float = 0.15, p_col: float = 0.15, resize_target: Unio
         self.p_col = p_col
         self.resize_target = resize_target
 
-    def __call__(self, sample: Union[ImgT, Tuple[ImgT, ImgT]]) -> Union[ImgT, Tuple[ImgT, ImgT]]:
+    def __call__(
+            self, sample: Union[ImgT,
+                                Tuple[ImgT,
+                                      ImgT]]) -> Union[torch.Tensor, ImgT, Tuple[Tensor, Tensor], Tuple[ImgT, ImgT]]:
         """Drops random rows and columns from up to two images.
 
         Args:

diff --git a/composer/algorithms/factorize/factorize_modules.py b/composer/algorithms/factorize/factorize_modules.py
@@ -327,8 +327,8 @@ def solution_for_rank(self, input: torch.Tensor, rank: int) -> LowRankSolution:
 
     def apply_solution(self, solution: LowRankSolution):
         self.latent_size = solution.rank
-        self.module0.out_channels = solution.rank
-        self.module1.in_channels = solution.rank
+        self.module0.out_channels = solution.rank  # pyright: ignore[reportGeneralTypeIssues]
+        self.module1.in_channels = solution.rank  # pyright: ignore[reportGeneralTypeIssues]
         _apply_solution_to_module_parameters(solution, self.module0, self.module1, transpose=False)
 
     @staticmethod
@@ -452,8 +452,8 @@ def solution_for_rank(self, input: torch.Tensor, rank: int) -> LowRankSolution:
 
     def apply_solution(self, solution: LowRankSolution) -> None:
         self.latent_size = solution.rank
-        self.module0.out_features = solution.rank
-        self.module1.in_features = solution.rank
+        self.module0.out_features = solution.rank  # pyright: ignore[reportGeneralTypeIssues]
+        self.module1.in_features = solution.rank  # pyright: ignore[reportGeneralTypeIssues]
         _apply_solution_to_module_parameters(solution, self.module0, self.module1, transpose=True)
 
     @staticmethod
@@ -471,9 +471,10 @@ def max_allowed_latent_channels(in_features: int, out_features: int) -> int:
 
     @staticmethod
     def from_linear(module: torch.nn.Linear, module_ix: int = -1, **kwargs) -> FactorizedLinear:
-        ret = FactorizedLinear(in_features=module.in_features,
-                               out_features=module.out_features,
-                               bias=((module.bias is not None) and (module.bias is not False)),
-                               **kwargs)
+        ret = FactorizedLinear(
+            in_features=module.in_features,
+            out_features=module.out_features,
+            bias=(module.bias is not None and module.bias is not False),  # pyright: ignore[reportUnnecessaryComparison]
+            **kwargs)
         ret.reset_parameters()
         return ret
diff --git a/composer/algorithms/fused_layernorm/README.md b/composer/algorithms/fused_layernorm/README.md
diff --git a/composer/algorithms/fused_layernorm/__init__.py b/composer/algorithms/fused_layernorm/__init__.py