Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Bump torch to 2.1.1 version (#2717) * Add more info when run doesnt complete (#2751) * Lower sequence generation length on code gen to be dependent on max canonical solution length (#2682) * sequentialize generations_per_sample * fix bug * lower generation length * lower generation length * lower generation length * fix gen len * restore * restore * restore * fix tests * fix test * Remove flatten params (#2761) * remove flatten params * simplify tests * simplify tests * clean * fix more tests * rerun tests * speed up icl * fix tests * fix cpu tests * add more fixtures * fix tests * token count * fix vocab size * remove logger * remove clears * fix mosaicml logger * change codeowners * clean up codeowners * rerun tests * shrink dataset * fix tests * fix test * rerun tests * fix tests * fix tests * fix seed * set to 0 * rerun tests * rerun tests * change threshold * rerun tests * rerun tests * logs * remove changes * logs * logs * remove logs * rerun tests * rerun tests * logs * rerun * logs * rerun * rerun * rerun tests * many more logs * rerun tests * strip logs * enable tests * remove opt * rerun tests * add test * lint * rerun tests * fix lint * lint * filter warnings * rerun tests * fixture * add fixture * change * logs * rerun tests * add logs * rerun tests * fixture * lint * lint * rerun tests * fix ignore warning * logs * regex * regex * regex * fix * logs * reformat * fix lint (#2767) * lint (#2768) * Use time.tokens for speedmonitor instead of dataset length (#2762) * change token math * tokens * add test * fix tests * remove exception (#2759) * time to clean up time parsing 😉 (#2770) * time to clean up time parsing * fix type error * updates * Upgrade RunConfig compute specification (#2772) * Upgrade RunConfig compute specification * extra cluster * Use async logging in MLflowLogger (#2693) * async mlflow logging Signed-off-by: chenmoneygithub <chen.qian@databricks.com> * small fix Signed-off-by: chenmoneygithub <chen.qian@databricks.com> * clean up * fix test * fix tests * deflake * pin mlflow --------- Signed-off-by: chenmoneygithub <chen.qian@databricks.com> * Fix FSDP _param_init_fn to not reinit parameters multiple times (#2765) * Gate FSDP param init test on torch 2.1 (#2774) * Parallelize OCI multipart download (#2750) * [UCVolumes] Add support for list API (#2769) * Add the memory timeline profiling support through the PyTorch profiler. (#2771) * v1 * fix issues * add logs * change names * comment * add device * uncomment original trace * add custome plot * fix pyright * Update composer/profiler/torch_profiler.py Co-authored-by: Charles Tang <j316chuck@users.noreply.github.com> * address comments * fix code check * fix formatting * address comments * add unit test * fix check * fix check * fix check * fix check * fix print * add test comment * add test comment --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> Co-authored-by: Charles Tang <j316chuck@users.noreply.github.com> * Improve torch memory profiling arguments processing (#2777) * improve torch profile args * improve torch profile args * change default torch_prof_memory_filename * add memory profiling arg test * fix check * fix check * fix check * fix check * fix check * fix check * Add platform AWS and bump aws ofi nccl version (#2776) * Extend checkpoint loading to accept a validation function (#2726) * Fix checkpoint validation tests for torch 1.13 (#2779) * fix checkpoint validation tests for torch 1.13 * Fix * Bump version to 0.17.2 (#2780) * bump version * 0.17.2 * update matrix * bump transformers version (#2781) * Bump sphinxext-opengraph from 0.9.0 to 0.9.1 (#2784) Bumps [sphinxext-opengraph](https://github.com/wpilibsuite/sphinxext-opengraph) from 0.9.0 to 0.9.1. - [Release notes](https://github.com/wpilibsuite/sphinxext-opengraph/releases) - [Commits](wpilibsuite/sphinxext-opengraph@v0.9.0...v0.9.1) --- updated-dependencies: - dependency-name: sphinxext-opengraph dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump coverage[toml] from 7.3.0 to 7.3.3 (#2783) Bumps [coverage[toml]](https://github.com/nedbat/coveragepy) from 7.3.0 to 7.3.3. - [Release notes](https://github.com/nedbat/coveragepy/releases) - [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst) - [Commits](nedbat/coveragepy@7.3.0...7.3.3) --- updated-dependencies: - dependency-name: coverage[toml] dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 (#2785) Updates the requirements on [torch](https://github.com/pytorch/pytorch) to permit the latest version. - [Release notes](https://github.com/pytorch/pytorch/releases) - [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md) - [Commits](pytorch/pytorch@v1.13.1...v2.1.2) --- updated-dependencies: - dependency-name: torch dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [UCVolumes] Rely on databricks-sdk auth for the right requirements (#2789) * Enable system metrics in mosaic mlflow logger (#2775) * Enable system metrics in mosaic mlflow logger * remove fixture * Update composer/loggers/mlflow_logger.py Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * Update composer/loggers/mlflow_logger.py Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * Update composer/loggers/mlflow_logger.py Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * Update parse_uri (#2787) * default-no-memory-timeline (#2790) * Add eot token to ICL generate kwargs (#2782) * add custome gen kwargs and stopping on eos token * modify test * modify test * finish * finish * finish * finish * Add nightly image for torch 2.2.0 12-20-23 (#2791) * Add torch nightly 12-13 (#2792) * Add process group as arg to FSDP (#2794) * add test * only cast if PG is specified * add to docstring * filter warning * filter warning * docs * support lists * remove warnings * lint * hsdp monkeypatch * logs * change log * fix patch * typo * clean up logs * Bump coverage[toml] from 7.3.3 to 7.3.4 (#2798) Bumps [coverage[toml]](https://github.com/nedbat/coveragepy) from 7.3.3 to 7.3.4. - [Release notes](https://github.com/nedbat/coveragepy/releases) - [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst) - [Commits](nedbat/coveragepy@7.3.3...7.3.4) --- updated-dependencies: - dependency-name: coverage[toml] dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix load_ignore_keys with rng (#2803) * fix rng load * lint * Bump ipykernel from 6.26.0 to 6.28.0 (#2806) Bumps [ipykernel](https://github.com/ipython/ipykernel) from 6.26.0 to 6.28.0. - [Release notes](https://github.com/ipython/ipykernel/releases) - [Changelog](https://github.com/ipython/ipykernel/blob/main/CHANGELOG.md) - [Commits](ipython/ipykernel@v6.26.0...v6.28.0) --- updated-dependencies: - dependency-name: ipykernel dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump junitparser from 3.1.0 to 3.1.1 (#2805) Bumps [junitparser](https://github.com/weiwei/junitparser) from 3.1.0 to 3.1.1. - [Changelog](https://github.com/weiwei/junitparser/blob/master/CHANGELOG.md) - [Commits](weiwei/junitparser@3.1.0...3.1.1) --- updated-dependencies: - dependency-name: junitparser dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump pytest from 7.4.3 to 7.4.4 (#2807) Bumps [pytest](https://github.com/pytest-dev/pytest) from 7.4.3 to 7.4.4. - [Release notes](https://github.com/pytest-dev/pytest/releases) - [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst) - [Commits](pytest-dev/pytest@7.4.3...7.4.4) --- updated-dependencies: - dependency-name: pytest dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Avoid futures on close for MosaicML logger (#2804) * avoid futures on close * typo * logs * logs * check (#2812) * Better communication computation overlap (#2811) * patched torch * fixed torch imports * fixed torch imports * fixed torch imports * patching through composer * patching through composer * patching typingr * comment added * don't patch torch 2.1.0 * patch torch 2.1.1 and 2.2.0 * linting fix * Improve error message for speed monitor (#2801) * fix flops * stacklevel * bump torch version (#2814) * bump vision (#2815) * fix rng load (#2816) * Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. (#2817) * patched torch * fixed torch imports * fixed torch imports * fixed torch imports * patching through composer * patching through composer * patching typingr * comment added * don't patch torch 2.1.0 * patch torch 2.1.1 and 2.2.0 * linting fix * waiting on computation stream from unshard stream * waiting on computation stream from unshard stream * less waiting * no waiting * all unshard streams wait on computation stream now * 2.2.0 dev change * fix profiler (#2818) * Bump traitlets from 5.13.0 to 5.14.1 (#2822) Bumps [traitlets](https://github.com/ipython/traitlets) from 5.13.0 to 5.14.1. - [Release notes](https://github.com/ipython/traitlets/releases) - [Changelog](https://github.com/ipython/traitlets/blob/main/CHANGELOG.md) - [Commits](ipython/traitlets@v5.13.0...v5.14.1) --- updated-dependencies: - dependency-name: traitlets dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * All unshard streams wait on computation every step (#2823) * patched torch * fixed torch imports * fixed torch imports * fixed torch imports * patching through composer * patching through composer * patching typingr * comment added * don't patch torch 2.1.0 * patch torch 2.1.1 and 2.2.0 * linting fix * waiting on computation stream from unshard stream * waiting on computation stream from unshard stream * less waiting * no waiting * all unshard streams wait on computation stream now * 2.2.0 dev change * correct waiting on computation stream * fsdp state typiung * patching root pre forward * patching root pre forward * fsdp state typing * patch forward * correct waiting * linting * Add encoding=utf-8 (#2824) * Fix import for daily test (#2826) * patched torch * fixed torch imports * fixed torch imports * fixed torch imports * patching through composer * patching through composer * patching typingr * comment added * don't patch torch 2.1.0 * patch torch 2.1.1 and 2.2.0 * linting fix * waiting on computation stream from unshard stream * waiting on computation stream from unshard stream * less waiting * no waiting * all unshard streams wait on computation stream now * 2.2.0 dev change * correct waiting on computation stream * fsdp state typiung * patching root pre forward * patching root pre forward * fsdp state typing * patch forward * correct waiting * linting * daily test change * daily test fix * [MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore (#2802) * Implementation of MLFlowObjectStore * Update object store test settings * Import mlflow dependencies inline * Fix tests and ignore some pyright * Bugfix * Enforce experiment and run in get_artifact_path * Update placeholders * Make logs debug instead of info * Minor PR comments * MLflow casing * tracking_uri fixes * Update comments * Update placeholders * Fix tests * Fix pyright * Use tempfile for temp dirs * Read tracking uri env var directly * Remove dist from MLFlowObjectStore --------- Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * Remove fused layernorm (already deprecated for 2 versions) (#2827) * remove fused layernorm * remove import * remove import * remove * fix * remove docs * all * fix * filter warnings * norm * lint * refactor --------- Co-authored-by: Your Name <you@example.com> * checkpoint saver tracks all checkpoints/intervals in state (#2819) * checkpoint tracking state * fix some tests * Update tests/callbacks/test_checkpoint_saver.py * Checkpoint itself should be included in state, dont pickle timestamp object * patch the key error (doesnt fix the bug though :sad:) * avoid slashes in state, adjust tests * fix gpu test, probably * formatting * feedback * add a comment * Apply suggestions from code review Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * code-quality timeout update (#2830) Timed out after 10 minutes here https://github.com/mosaicml/composer/actions/runs/7465107219/job/20313553654?pr=2819 Bumps runtime up to 15min * [S] Fix how single value tensors are logged (#2831) Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * Adds DTensor Support (#2821) * fixes to get dtensor to work * more fixes * Change state dict materialization for new version of torch * get load working for new set_state_dict api * use device_mesh * Add fsdp init monkeypatch for DTensor * Add checkpoint profiling logs * attempt * working single node * fix optimizer * allow 3d device mesh * attempt to use different pg during 3d mesh save * undo 3d mesh changes * load_state_dict -> load * allow parent mesh in FSDP init * allow override of force_sync_module_states * remove unnecessary exit * ignore _validate_and_get_shard_state() * save/load hsdp-moe working * remove prints * v1 * v2 * lint * add more tests * switch to PRs * ignore warning * fix lint * version error * fix version * fix state dict * update versions * lint * lint * disable lint for mosaic fsdp utils * remove bad line * move around for legacy * device mesh * ignore warning * fix import * always init * fix error * fix load planner * remove * fix lint * lint * delay state dict * test checkpoint * checkpoint * fix cpu tests * fix rotate tests * fix precision * lint * fix alibi * cleanup * cleanup * remove force sync * fix type * merge * lint * fix gpt * comment * fix test * lint * minor optimizations * Update composer/core/state.py Co-authored-by: Evan Racah <evan@mosaicml.com> * revert tests --------- Co-authored-by: Evan Racah <ejracah@gmail.com> Co-authored-by: Abhinav Venigalla <abhi.venigalla@databricks.com> Co-authored-by: root <23239305+b-chu@users.noreply.github.com> Co-authored-by: Abhinav Venigalla <abhi@mosaicml.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: Evan Racah <evan@mosaicml.com> * Remove duplicate checkpoint verifications (#2828) * Fix seed for FSDP wrap (#2833) * first try * add context * lint * more lint * remove comment --------- Co-authored-by: Daniel King <daniel@mosaicml.com> Co-authored-by: Your Name <you@example.com> * Remove fsdp patch for comm overlap (#2836) * allow hsdp (#2838) * Bump torch 2.1.2 (#2840) * bump torch * bump * bump * Upgrade pyright to 1.1.310 (#2841) * [MLFlowObjectStore] [2/2] Support checkpointing with MLFlow (#2810) * Support checkpoint uploads to MLFlow (untested) Use MLFlow run tag for autoresume Add MLFlowLogger test for existing composer run tag * Try formatting mlflow save folder after INIT Make MLFlow experiment and run ID available on all ranks Fix path issue Format mlflow placeholders in remote filenames * Unit tests for partial_format * Log mlflow info as hyperparams * partial_format doc update * Fix formatting * Pull distributed logic out of MLFlowObjectStore Add debug tracebacks Bugfix Add path to debug info Try fixing RUD object store init Pyright * Partial format in format_name helpers * Fix import * Add extra partial_format test * Fix mlflow RUD check * Fix test pyright No longer expect KeyError for format_with_dist using partial_format Refactor partial_format for readability * Max iters on partial_format * Fix partial_format * Clean up * fix test import * Fix test * update nightly to torch 2.3 (#2842) * update nightly to torch 2.3 * tighten --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * Pin sphinxcontrib applehelp (#2854) * pin release * bump * break pypi * tighter pin * pin * pin * pin * Update setup.py (#2855) * Torch 2.3 patch (#2849) * add monkeypatch for verify_options * patch * fix * fix * partial precommit * bit of cleanup * doc * debug * fix version pinning * precommit * checkdown * lint --------- Co-authored-by: Evan Racah <ejracah@gmail.com> Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 (#2866) Updates the requirements on [mosaicml-cli](https://github.com/mosaicml/mosaicml-cli) to permit the latest version. - [Commits](https://github.com/mosaicml/mosaicml-cli/commits) --- updated-dependencies: - dependency-name: mosaicml-cli dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Rewrite to use individual state functions (#2860) * checkdown * checkdown * lint * fix * load ignore keys * fix * resolve comments * fix load ignore keys * offload * fix gate * merge * lint * use flag * force trye * Add custom stopping criteria to ICL generate tasks (#2800) * add custome gen kwargs and stopping on eos token * modify test * modify test * finish * finish * finish * finish * finish pr * implement early stop * add tesT * fix bug * bug fix * add keys * diff split * fix typo * fix precommit * fix precommit * fix precommit * fix precommit * fix precommit * fix precommit * fix conditional import * add nlp metrics * remove code gen changes * fix nits --------- Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * Add save_ignore_keys (#2868) * comment * add it * debug * add the keys * debug * debug * remove print statement * docs and tests * fix tests --------- Co-authored-by: Daniel King <daniel@mosaicml.com> --------- Signed-off-by: chenmoneygithub <chen.qian@databricks.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Charles Tang <j316chuck@users.noreply.github.com> Co-authored-by: Anna <anna@mosaicml.com> Co-authored-by: Jeremy D <115047575+bmosaicml@users.noreply.github.com> Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> Co-authored-by: Chen Qian <chenmoney@google.com> Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> Co-authored-by: coryMosaicML <83666378+coryMosaicML@users.noreply.github.com> Co-authored-by: Harsh Panchal <68880048+panchalhp-db@users.noreply.github.com> Co-authored-by: willgleich <22464726+willgleich@users.noreply.github.com> Co-authored-by: Irene Dea <deaairene@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: snarayan21 <saaketh@mosaicml.com> Co-authored-by: Jerry Chen <jerry.chen@databricks.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: Evan Racah <ejracah@gmail.com> Co-authored-by: Abhinav Venigalla <abhi.venigalla@databricks.com> Co-authored-by: root <23239305+b-chu@users.noreply.github.com> Co-authored-by: Abhinav Venigalla <abhi@mosaicml.com> Co-authored-by: Evan Racah <evan@mosaicml.com> Co-authored-by: Daniel King <daniel@mosaicml.com>
- Loading branch information