Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nccl env var #1695

Merged
merged 2 commits into from
Nov 3, 2022
Merged

Conversation

mvpatel2000
Copy link
Contributor

@mvpatel2000 mvpatel2000 commented Nov 3, 2022

What does this PR do?

The timeout variable was being ignored before. We should actually use it by setting the appropriate env vars. From Pytorch Dist docs:

"""
– Timeout for operations executed against the process group. Default value equals 30 minutes. This is applicable for the gloo backend. For nccl, this is applicable only if the environment variable NCCL_BLOCKING_WAIT or NCCL_ASYNC_ERROR_HANDLING is set to 1.
"""

Also updates default dist_timeout to be same as what Pytorch has (30 min).

What issue(s) does this change relate to?

CO-1356

Copy link
Contributor

@bandish-shah bandish-shah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mvpatel2000 mvpatel2000 merged commit 96495a5 into mosaicml:dev Nov 3, 2022
@mvpatel2000 mvpatel2000 deleted the mvpatel2000/nccl-vars branch November 3, 2022 18:22
bandish-shah pushed a commit to bandish-shah/composer that referenced this pull request Nov 10, 2022
* add env var

* rerun tests, tranisent error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants