Set device in dataloaders #654

edknv · 2023-03-22T06:53:18Z

Fixes #651

Goals ⚽

Fix multi-gpu training notebook.

Implementation Details 🚧

Depends on Put row lengths on the same device on gpu dataloader#113.
device is set identical to local_rank.
Without dropping the last batch (dataloader_drop_last=True), recsys_trainer.evaluate hangs. Probably need to investigate this because this didn't happen before the list column refactoring in merlin-dataloader (see ticket).
torch.distributed.launch is replaced with torchrun because the former has been deprecated.

Testing Details 🔍

Manually tested in nvcr.io/nvidia/merlin/merlin-pytorch:23.02 by installing the main branch of all Merlin libraries via pip install . --no-deps. Run 01 notebook first and then run 03 notebook.

review-notebook-app · 2023-03-22T06:53:22Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

github-actions · 2023-03-22T07:05:09Z

Documentation preview

https://nvidia-merlin.github.io/Transformers4Rec/review/pr-654

rnyak · 2023-03-22T13:50:52Z

@edknv thanks for the quick fix. I have just one comment. In this doc is says if use torchrun Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:

import os
local_rank = int(os.environ["LOCAL_RANK"])

what do you think? does it make a big difference or not?

edknv · 2023-03-22T15:36:18Z

@edknv thanks for the quick fix. I have just one comment. In this doc is says if use torchrun Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:
import os
local_rank = int(os.environ["LOCAL_RANK"])
what do you think? does it make a big difference or not?

In our case, it doesn't look like it makes a difference. From what I can tell, torchrun seems to make use of the local rank automatically. I think the doc is saying, if you need to use the local_rank variable in your script, use local_rank = int(os.environ["LOCAL_RANK"]). Our script does not make use of this variable, so I didn't include it.

Set device in dataloaders

1478036

edknv self-assigned this Mar 22, 2023

edknv added the bug Something isn't working label Mar 22, 2023

edknv added this to the Merlin 23.03 milestone Mar 22, 2023

edknv requested a review from rnyak March 22, 2023 06:58

rnyak approved these changes Mar 22, 2023

View reviewed changes

karlhigley merged commit ff5d304 into NVIDIA-Merlin:main Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set device in dataloaders #654

Set device in dataloaders #654

edknv commented Mar 22, 2023 •

edited

Loading

review-notebook-app bot commented Mar 22, 2023

github-actions bot commented Mar 22, 2023

rnyak commented Mar 22, 2023

edknv commented Mar 22, 2023

Set device in dataloaders #654

Set device in dataloaders #654

Conversation

edknv commented Mar 22, 2023 • edited Loading

Goals ⚽

Implementation Details 🚧

Testing Details 🔍

review-notebook-app bot commented Mar 22, 2023

github-actions bot commented Mar 22, 2023

Documentation preview

rnyak commented Mar 22, 2023

edknv commented Mar 22, 2023

edknv commented Mar 22, 2023 •

edited

Loading