Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #651
Goals ⚽
Fix multi-gpu training notebook.
Implementation Details 🚧
device
is set identical tolocal_rank
.dataloader_drop_last=True
),recsys_trainer.evaluate
hangs. Probably need to investigate this because this didn't happen before the list column refactoring in merlin-dataloader (see ticket).torch.distributed.launch
is replaced withtorchrun
because the former has been deprecated.Testing Details 🔍
Manually tested in
nvcr.io/nvidia/merlin/merlin-pytorch:23.02
by installing the main branch of all Merlin libraries viapip install . --no-deps
. Run 01 notebook first and then run 03 notebook.