Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change FP8 Eval to default to activation dtype #3454

Merged
merged 87 commits into from
Jul 11, 2024
Merged

Conversation

j316chuck
Copy link
Contributor

@j316chuck j316chuck commented Jul 5, 2024

What does this PR do?

Turn off te fp8 autocast for eval. This logic works around the incompatability where composer eval batches can be of any arbitrary shape whereas TE requires that eval batches are divisible by 16.

What issue(s) does this change relate to?

https://databricks.atlassian.net/browse/GRT-3023

Unit Tests

  • H100: WORLD_SIZE=1 EXTRA_ARGS='-k test_amp_fp8_eval_casts_to_bf16' make test-dist-gpu
  • A100: WORLD_SIZE=1 EXTRA_ARGS='-k test_amp_fp8_eval_casts_to_bf16' make test-dist-gpu ✔️ (skipped)

Regression Tests

  • Before: fp8-llama3-8b-metamath-4ep-3AOZfj 🔴 (fails on model eval)
  • After: fp8-llama3-8b-metamath-4ep-tIKVb1 ✅ (works on model eval)

Along for the ride

  • Fix spurious test failure in HF tokenizer by adding retry logic to test fixture

Chuck Tang and others added 29 commits June 25, 2024 14:49
Bumps [coverage[toml]](https://github.com/nedbat/coveragepy) from 7.5.3 to 7.5.4.
- [Release notes](https://github.com/nedbat/coveragepy/releases)
- [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
- [Commits](nedbat/coveragepy@7.5.3...7.5.4)

---
updated-dependencies:
- dependency-name: coverage[toml]
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Updates the requirements on [psutil](https://github.com/giampaolo/psutil) to permit the latest version.
- [Changelog](https://github.com/giampaolo/psutil/blob/master/HISTORY.rst)
- [Commits](giampaolo/psutil@release-5.8.0...release-6.0.0)

---
updated-dependencies:
- dependency-name: psutil
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
* Add support for variable length dataloaders in dist training

* Remove test file

* Fix typo

* Fixed batch referenced before assignment

* Replace sentinel with None

* Add unit test

* Update unit test

* Reduce tensor creation to one line

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

* Remove requirement for gpu in test

---------

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
* fold ema fsdp state

* debug

* debug

* more debug

* keep debugging

* debug

* sanity check

* debug

* debug

* use ema

* debug

* debug

* debug

* debug

* debug

* debug

* more fix

* filename test

* revert test

* fully parameterize

* hsdp test

* revert testing

* typo

* typo

* hsdp

* split off test

* precommit

* float to int

* pyright

* oom

* print

* rm tp

* tp cfg

* tp?

* rm tp line

* type annotation

* revert

* readd tp

* type

* world size

* revert

* revert monolithic cpkt + include sharded cpkt

* enumerate

* precommit

* precommit

* sharded

* sync

* only sync on first trainer

* typo

* hsdp

* xfail

* explicit sync

* test

* revert test

* sync, docker issue

* pre-commit

* sync

* pytest

* xfail

* rm world_size param

* im so sorry pls forgive me king

* the kings comments

* Update tests/trainer/test_fsdp_checkpoint.py

fix formatting

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

* precommit

---------

Co-authored-by: v-chen_data <v-chen_data@example.com>
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
* test

* skip if torch version less than 2.3

* typo in ema

* add remote

* comments

---------

Co-authored-by: v-chen_data <v-chen_data@example.com>
* remove save overwrite

* fix tests

* lint

* remove bad test
…load (#3436)

* lower the system metrics logging frequency

* more frequent
Updates the requirements on [paramiko](https://github.com/paramiko/paramiko) to permit the latest version.
- [Commits](paramiko/paramiko@2.11.0...3.4.0)

---
updated-dependencies:
- dependency-name: paramiko
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* test

* test

* test

* test

* test

* test

* fix

* sleep before skip

* fix

* pull request target

* revert

* revery pr_target branches

* sleep 1

* 10 sec

* uncomment

* dist barrier

* test

* dist works!

* update 0.0.9

* mihir comment

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

---------

Co-authored-by: v-chen_data <v-chen_data@example.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
@j316chuck j316chuck marked this pull request as ready for review July 6, 2024 02:21
@j316chuck j316chuck requested a review from mvpatel2000 July 11, 2024 00:57
composer/core/precision.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
@j316chuck j316chuck requested a review from mvpatel2000 July 11, 2024 06:49
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Ty :). Can you please ensure unit test runs on an interactive instance?

@j316chuck j316chuck enabled auto-merge (squash) July 11, 2024 19:20
@j316chuck j316chuck merged commit 10984a4 into dev Jul 11, 2024
14 checks passed
@j316chuck j316chuck deleted the chuck/fix_pytorch_patch branch July 11, 2024 20:28
@j316chuck j316chuck changed the title Fix pytorch eval for fp8 Change FP8 Eval for TE to default to activation data type Jul 11, 2024
@j316chuck j316chuck changed the title Change FP8 Eval for TE to default to activation data type Change FP8 Eval to default to activation data_type Jul 11, 2024
@j316chuck j316chuck changed the title Change FP8 Eval to default to activation data_type Change FP8 Eval to default to activation dtype Jul 11, 2024
@j316chuck j316chuck changed the title Change FP8 Eval to default to activation dtype Change FP8 Eval to default to activation dtype Jul 11, 2024
mvpatel2000 pushed a commit to mvpatel2000/composer that referenced this pull request Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants