Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set device in dataloaders #654

Merged
merged 1 commit into from
Mar 22, 2023
Merged

Conversation

edknv
Copy link
Collaborator

@edknv edknv commented Mar 22, 2023

Fixes #651

Goals ⚽

Fix multi-gpu training notebook.

Implementation Details 🚧

  • Depends on Put row lengths on the same device on gpu dataloader#113.
  • device is set identical to local_rank.
  • Without dropping the last batch (dataloader_drop_last=True), recsys_trainer.evaluate hangs. Probably need to investigate this because this didn't happen before the list column refactoring in merlin-dataloader (see ticket).
  • torch.distributed.launch is replaced with torchrun because the former has been deprecated.

Testing Details 🔍

Manually tested in nvcr.io/nvidia/merlin/merlin-pytorch:23.02 by installing the main branch of all Merlin libraries via pip install . --no-deps. Run 01 notebook first and then run 03 notebook.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@edknv edknv self-assigned this Mar 22, 2023
@edknv edknv added the bug Something isn't working label Mar 22, 2023
@edknv edknv added this to the Merlin 23.03 milestone Mar 22, 2023
@edknv edknv requested a review from rnyak March 22, 2023 06:58
@github-actions
Copy link

@rnyak
Copy link
Contributor

rnyak commented Mar 22, 2023

@edknv thanks for the quick fix. I have just one comment. In this doc is says if use torchrun Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:

import os
local_rank = int(os.environ["LOCAL_RANK"])

what do you think? does it make a big difference or not?

@karlhigley karlhigley merged commit ff5d304 into NVIDIA-Merlin:main Mar 22, 2023
@edknv
Copy link
Collaborator Author

edknv commented Mar 22, 2023

@edknv thanks for the quick fix. I have just one comment. In this doc is says if use torchrun Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:

import os
local_rank = int(os.environ["LOCAL_RANK"])

what do you think? does it make a big difference or not?

In our case, it doesn't look like it makes a difference. From what I can tell, torchrun seems to make use of the local rank automatically. I think the doc is saying, if you need to use the local_rank variable in your script, use local_rank = int(os.environ["LOCAL_RANK"]). Our script does not make use of this variable, so I didn't include it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Multi-gpu training notebook is giving error if we generate schema from core
3 participants