Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed cuda memory allocation which was causing CUDA OOM #103

Merged
merged 4 commits into from
Dec 29, 2022

Conversation

karan6181
Copy link
Collaborator

@karan6181 karan6181 commented Dec 23, 2022

Description of changes:

  • Removed the CUDA memory allocation which was used to synchronize the shuffle_seed and prefix_int across ranks.
  • Based on how randomness works, the initial value shuffle_seed or prefix_int would be same across all the ranks, hence, there is no need for dist.broadcast operation.
  • User needs to set the seed for getting the deterministic behavior on single node and multi-node training which is the same behavior as before.

Issue #, if available:

CO-1591

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the contributor guidelines
  • This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • I ran the tests locally to make sure it pass. (check out testing)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

@dakinggg
Copy link
Contributor

QQ: Is it actually guaranteed that prefix_int and shuffle_seed are the same across all ranks (including multi node)? I think this may only be true if the main training script seeds randomness.

@karan6181 karan6181 changed the title Removed cuda memory allocation which was causing CUDA OOM [WIP] Removed cuda memory allocation which was causing CUDA OOM Dec 23, 2022
Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, suggest seeding the randomness explicitly or creating an rng instance so as not to rely on randomness being the same across nodes/ranks. creating the rng instance might be better so as not to affect a user's other random seeding, not sure.

streaming/base/dataset.py Outdated Show resolved Hide resolved
streaming/base/dataset.py Outdated Show resolved Hide resolved
streaming/base/dataset.py Outdated Show resolved Hide resolved
streaming/base/util.py Outdated Show resolved Hide resolved
@karan6181 karan6181 changed the title [WIP] Removed cuda memory allocation which was causing CUDA OOM Removed cuda memory allocation which was causing CUDA OOM Dec 28, 2022
@karan6181
Copy link
Collaborator Author

QQ: Is it actually guaranteed that prefix_int and shuffle_seed are the same across all ranks (including multi node)? I think this may only be true if the main training script seeds randomness.

Yes, as of now, for deterministic behavior, user needs to set the seeds. If the user sets the seeds, then the behavior is deterministic on single and multi-node.

@dakinggg
Copy link
Contributor

Should we set a default shuffle seed? Help the user avoid surprising outcomes, like having a different shuffle on each node. (Fwiw, I don't think our examples in the examples repo explicitly set the shuffle seed)

@knighton
Copy link
Contributor

Can we just broadcast seed over gloo or something? Non-deterministic is a bug

@dakinggg
Copy link
Contributor

Is there a problem with setting a default shuffle seed? To avoid adding unnecessary distributed stuff. Iirc having a separate gloo process group on mcloud is non trivial.

@knighton
Copy link
Contributor

Good points... sorry am in Guanajuato with terrible reception... cannot really field code reviews atm... defer to what y'all think is best

@abhi-mosaic
Copy link
Contributor

Would also prefer to avoid calling torch.distributed ! Greatly reduces the number of potential bugs. A default seed seems ok with me, unless someone can think of potential bugs?

@karan6181
Copy link
Collaborator Author

Should we set a default shuffle seed? Help the user avoid surprising outcomes, like having a different shuffle on each node. (Fwiw, I don't think our examples in the examples repo explicitly set the shuffle seed)

Yes, we can. As of today, we kept the shuffle_seed to None as a default value to get the data randomness behavior and if the user needs a deterministic behavior, then they would have to set the seed value in their code, although doesn't necessarily have to set shuffle_seed explicitly. If user is using composer, then they can directly add composer.utils.reproducibility.seed_all(seed) in their script.

Currently, I don't think so of any application where user needs a data randomness behavior in every run, and if they would, then they can set the shuffle_seed to different values in every run to get the data randomness behavior.

The commit a8ef59c adds the default shuffle_seed value.

@karan6181
Copy link
Collaborator Author

Would also prefer to avoid calling torch.distributed ! Greatly reduces the number of potential bugs. A default seed seems ok with me, unless someone can think of potential bugs?

Agree. Adding a torch.distributed calls might surfaces some unknown bugs. With having a specific or a default shuffle_seed, user can get a deterministic behavior and non-deterministic behavior with different seed value in every run.

Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

streaming/base/dataset.py Outdated Show resolved Hide resolved
@karan6181 karan6181 merged commit b7dbadd into mosaicml:main Dec 29, 2022
@karan6181 karan6181 deleted the cuda_oom branch December 29, 2022 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants