Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-node distributed training, submitit & composer integration demo #1753

Merged
merged 12 commits into from
Jan 9, 2023

Conversation

YilunKuang
Copy link
Contributor

@YilunKuang YilunKuang commented Nov 24, 2022

What does this PR do?

This PR adds a Jupyter notebook demo of how to submit a multi-node distributed training job on the SLURM cluster without using the composer launcher. Specifically, I show how to use submitit with composer in the multi-node distributed training setting. submitit automates SLURM cluster job submission in python scripts without any shell scripts, and this implementation shows how to properly set up the distributed training environment variables in python.

What issue(s) does this change relate to?

It's related to some of the requests posted on the MosaicML slack channel.

Before submitting

  • [Yes] Have you read the contributor guidelines?
  • [Yes] Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@mvpatel2000
Copy link
Contributor

@YilunKuang thanks for opening this PR! Just wanted to confirm this is ready for review / we can take a look at it?

CC: @dakinggg @bandish-shah

@YilunKuang
Copy link
Contributor Author

@YilunKuang thanks for opening this PR! Just wanted to confirm this is ready for review / we can take a look at it?

CC: @dakinggg @bandish-shah

@mvpatel2000 Yes, this is ready for review. The only change I made is I added another Jupiter notebook as a guide to do distributed training with composer without the composer launcher and I didn't change the rest of the codebase.

If this guide is accepted, I would recommend adding a reference to this guide in the Distributed Training doc https://docs.mosaicml.com/en/stable/notes/distributed_training.html

@bandish-shah
Copy link
Contributor

Hi @YilunKuang this is great thanks for submitting this PR with an example for submitit. Unfortunately I think we'll have to mask out this specific notebook from running through our typical CI tests since we don't have any SLURM infra setup to test with.

Have you tested this notebook out on your end? If so could you please post some screen shots of the output a user is expected to see in the description of this PR? We can use that as a stop gap for any users interested in leveraging this example as well as evidence the that notebook code is functional, at least at the time of this submission.

@YilunKuang
Copy link
Contributor Author

Hi @YilunKuang this is great thanks for submitting this PR with an example for submitit. Unfortunately I think we'll have to mask out this specific notebook from running through our typical CI tests since we don't have any SLURM infra setup to test with.

Have you tested this notebook out on your end? If so could you please post some screen shots of the output a user is expected to see in the description of this PR? We can use that as a stop gap for any users interested in leveraging this example as well as evidence the that notebook code is functional, at least at the time of this submission.

@bandish-shah Sure no problem. I just thought this might be helpful for someone else working with the SLURM system. Here is a screenshot of the output:

Screen Shot 2022-12-12 at 20 46 36

Screen Shot 2022-12-12 at 20 46 50

These outputs are not generated directly from the Jupyter notebook but are generated by running the last code block in the Jupyter notebook as a python file.

Also, I ran my script in a multi-node setting (2 nodes, each with 4 GPUs) so there are in total 8 output files from SLURM. Let me know and I can send over the whole zip file containing output from 8 GPUs.

@bandish-shah
Copy link
Contributor

@YilunKuang it's very helpful as we've had others ask questions around how to use Composer with SLURM. Thank you again for doing this!

@bandish-shah
Copy link
Contributor

bandish-shah commented Dec 13, 2022

@YilunKuang it looks like there's a few minor CI issues that need to be addressed before we can let this through.

  1. Failing code quality checks, this can likely addressed by running the pre-commit hooks locally: https://github.com/mosaicml/composer/blob/dev/STYLE_GUIDE.md#12-pre-commit-hooks
  2. Failing doctest, need to manually add the example to the Tutorials toctree, please see: https://github.com/mosaicml/composer/blob/dev/docs/source/index.rst#L73

@YilunKuang
Copy link
Contributor Author

2. https://github.com/mosaicml/composer/blob/dev/docs/source/index.rst#L73

@bandish-shah Glad to help!

For 1, it seems like I cannot access the notion page. I got a message saying "You do not have access to MosaicML. Please contact an admin to add you as a member".

For 2, thanks I will try to add the additional changes to the pull request these two days

@bandish-shah
Copy link
Contributor

Typo on my end, meant to post a link to our style guide which details how to run pre-commit hooks locally, not a notion page 😅. URL fixed in the original comment.

@YilunKuang
Copy link
Contributor Author

Typo on my end, meant to post a link to our style guide which details how to run pre-commit hooks locally, not a notion page 😅. URL fixed in the original comment.

@bandish-shah Gotcha thanks! Will push my changes this weekend.

@YilunKuang
Copy link
Contributor Author

@bandish-shah Just checked the two things you mentioned! My last commit (db9a8b9) seems to pass all the test, but when I manually merged the latest change from the dev branch with this commit 0c3aa3e something failed. Can you point me to what I should do about it? Thanks!

@dakinggg
Copy link
Contributor

@YilunKuang Looks like it was a transient error :) I merged dev again and it looks like everything is passing, thanks!

@YilunKuang
Copy link
Contributor Author

@dakinggg Thanks for looking into it! Feel free to close this pull request.

@dakinggg dakinggg enabled auto-merge (squash) January 9, 2023 23:28
@dakinggg dakinggg merged commit ab0ae48 into mosaicml:dev Jan 9, 2023
@YilunKuang YilunKuang deleted the multinode_submitit_demo branch January 10, 2023 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants