Removed cuda memory allocation which was causing CUDA OOM #103

karan6181 · 2022-12-23T18:17:24Z

Description of changes:

Removed the CUDA memory allocation which was used to synchronize the shuffle_seed and prefix_int across ranks.
Based on how randomness works, the initial value shuffle_seed or prefix_int would be same across all the ranks, hence, there is no need for dist.broadcast operation.
User needs to set the seed for getting the deterministic behavior on single node and multi-node training which is the same behavior as before.

Issue #, if available:

CO-1591

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the contributor guidelines
This is a documentation change or typo fix. If so, skip the rest of this checklist.
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
I have added tests that prove my fix is effective or that my feature works (if appropriate).
I ran the tests locally to make sure it pass. (check out testing)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

dakinggg · 2022-12-23T18:23:41Z

QQ: Is it actually guaranteed that prefix_int and shuffle_seed are the same across all ranks (including multi node)? I think this may only be true if the main training script seeds randomness.

dakinggg

As discussed offline, suggest seeding the randomness explicitly or creating an rng instance so as not to rely on randomness being the same across nodes/ranks. creating the rng instance might be better so as not to affect a user's other random seeding, not sure.

streaming/base/dataset.py

streaming/base/util.py

karan6181 · 2022-12-28T10:40:40Z

QQ: Is it actually guaranteed that prefix_int and shuffle_seed are the same across all ranks (including multi node)? I think this may only be true if the main training script seeds randomness.

Yes, as of now, for deterministic behavior, user needs to set the seeds. If the user sets the seeds, then the behavior is deterministic on single and multi-node.

dakinggg · 2022-12-28T15:58:57Z

Should we set a default shuffle seed? Help the user avoid surprising outcomes, like having a different shuffle on each node. (Fwiw, I don't think our examples in the examples repo explicitly set the shuffle seed)

knighton · 2022-12-28T17:13:18Z

Can we just broadcast seed over gloo or something? Non-deterministic is a bug

dakinggg · 2022-12-28T17:21:04Z

Is there a problem with setting a default shuffle seed? To avoid adding unnecessary distributed stuff. Iirc having a separate gloo process group on mcloud is non trivial.

knighton · 2022-12-28T17:34:05Z

Good points... sorry am in Guanajuato with terrible reception... cannot really field code reviews atm... defer to what y'all think is best

abhi-mosaic · 2022-12-28T20:48:32Z

Would also prefer to avoid calling torch.distributed ! Greatly reduces the number of potential bugs. A default seed seems ok with me, unless someone can think of potential bugs?

karan6181 · 2022-12-29T03:27:16Z

Should we set a default shuffle seed? Help the user avoid surprising outcomes, like having a different shuffle on each node. (Fwiw, I don't think our examples in the examples repo explicitly set the shuffle seed)

Yes, we can. As of today, we kept the shuffle_seed to None as a default value to get the data randomness behavior and if the user needs a deterministic behavior, then they would have to set the seed value in their code, although doesn't necessarily have to set shuffle_seed explicitly. If user is using composer, then they can directly add composer.utils.reproducibility.seed_all(seed) in their script.

Currently, I don't think so of any application where user needs a data randomness behavior in every run, and if they would, then they can set the shuffle_seed to different values in every run to get the data randomness behavior.

The commit a8ef59c adds the default shuffle_seed value.

karan6181 · 2022-12-29T03:29:24Z

Would also prefer to avoid calling torch.distributed ! Greatly reduces the number of potential bugs. A default seed seems ok with me, unless someone can think of potential bugs?

Agree. Adding a torch.distributed calls might surfaces some unknown bugs. With having a specific or a default shuffle_seed, user can get a deterministic behavior and non-deterministic behavior with different seed value in every run.

dakinggg

LGTM, thanks!

streaming/base/dataset.py

karan6181 requested a review from knighton December 23, 2022 18:17

karan6181 changed the title ~~Removed cuda memory allocation which was causing CUDA OOM~~ [WIP] Removed cuda memory allocation which was causing CUDA OOM Dec 23, 2022

dakinggg reviewed Dec 23, 2022

View reviewed changes

streaming/base/dataset.py Outdated Show resolved Hide resolved

streaming/base/dataset.py Outdated Show resolved Hide resolved

knighton reviewed Dec 23, 2022

View reviewed changes

streaming/base/dataset.py Outdated Show resolved Hide resolved

knighton reviewed Dec 23, 2022

View reviewed changes

streaming/base/dataset.py Show resolved Hide resolved

knighton reviewed Dec 23, 2022

View reviewed changes

streaming/base/util.py Outdated Show resolved Hide resolved

karan6181 changed the title ~~[WIP] Removed cuda memory allocation which was causing CUDA OOM~~ Removed cuda memory allocation which was causing CUDA OOM Dec 28, 2022

karan6181 added 3 commits December 28, 2022 18:57

Removed cuda memory allocation which was causing CUDA OOM

7790cff

changed parameter names and added random generator

9a18c83

Add default shuffle_seed value

a8ef59c

karan6181 force-pushed the cuda_oom branch from b9d6e7b to a8ef59c Compare December 29, 2022 03:26

dakinggg approved these changes Dec 29, 2022

View reviewed changes

streaming/base/dataset.py Outdated Show resolved Hide resolved

Changed the shuffle_seed annotation type to just int

6942585

karan6181 merged commit b7dbadd into mosaicml:main Dec 29, 2022

karan6181 deleted the cuda_oom branch December 29, 2022 06:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed cuda memory allocation which was causing CUDA OOM #103

Removed cuda memory allocation which was causing CUDA OOM #103

karan6181 commented Dec 23, 2022 •

edited

Loading

dakinggg commented Dec 23, 2022

dakinggg left a comment •

edited

Loading

karan6181 commented Dec 28, 2022

dakinggg commented Dec 28, 2022

knighton commented Dec 28, 2022

dakinggg commented Dec 28, 2022

knighton commented Dec 28, 2022

abhi-mosaic commented Dec 28, 2022

karan6181 commented Dec 29, 2022

karan6181 commented Dec 29, 2022

dakinggg left a comment

Removed cuda memory allocation which was causing CUDA OOM #103

Removed cuda memory allocation which was causing CUDA OOM #103

Conversation

karan6181 commented Dec 23, 2022 • edited Loading

Description of changes:

Issue #, if available:

Merge Checklist:

General

Tests

dakinggg commented Dec 23, 2022

dakinggg left a comment • edited Loading

Choose a reason for hiding this comment

karan6181 commented Dec 28, 2022

dakinggg commented Dec 28, 2022

knighton commented Dec 28, 2022

dakinggg commented Dec 28, 2022

knighton commented Dec 28, 2022

abhi-mosaic commented Dec 28, 2022

karan6181 commented Dec 29, 2022

karan6181 commented Dec 29, 2022

dakinggg left a comment

Choose a reason for hiding this comment

karan6181 commented Dec 23, 2022 •

edited

Loading

dakinggg left a comment •

edited

Loading