-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] trainer.model.module renamed and DataParallel mode fixed #483
Conversation
Click to view CI ResultsGitHub pull request #483 of commit 0aaf6e3643798429cfe19231ffac705f616621a8, no merge conflicts. Running as SYSTEM Setting status of 0aaf6e3643798429cfe19231ffac705f616621a8 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/193/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/483/*:refs/remotes/origin/pr/483/* # timeout=10 > git rev-parse 0aaf6e3643798429cfe19231ffac705f616621a8^{commit} # timeout=10 Checking out Revision 0aaf6e3643798429cfe19231ffac705f616621a8 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 0aaf6e3643798429cfe19231ffac705f616621a8 # timeout=10 Commit message: "trainer.model.module renamed and DataParallel mode fixed" > git rev-list --no-walk 9978964cc04329b588d8d56a59904aee49af58eb # timeout=10 [transformers4rec_tests] $ /bin/bash /tmp/jenkins16552449295514341877.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1 item |
Documentation previewhttps://nvidia-merlin.github.io/Transformers4Rec/review/pr-483 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @nzarif . Thanks for the fix!
It worth highlighting that you have tested this fix with our integration tests, which uses the script for paper reproducibility, and they started using multiple GPU when available. You mentioned me that only TransformerXL (transfoxl
) raised an error in DataParallel
mode, which is a know issue we had identified in the past (May 2021) during the paper experiments, as documented in the Notes here
Click to view CI ResultsGitHub pull request #483 of commit e5d579aa58d5b906996d3ff66c4a1587cebc4352, no merge conflicts. Running as SYSTEM Setting status of e5d579aa58d5b906996d3ff66c4a1587cebc4352 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/194/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/483/*:refs/remotes/origin/pr/483/* # timeout=10 > git rev-parse e5d579aa58d5b906996d3ff66c4a1587cebc4352^{commit} # timeout=10 Checking out Revision e5d579aa58d5b906996d3ff66c4a1587cebc4352 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f e5d579aa58d5b906996d3ff66c4a1587cebc4352 # timeout=10 Commit message: "Merge branch 'main' into dataparallel_fix" > git rev-list --no-walk 0aaf6e3643798429cfe19231ffac705f616621a8 # timeout=10 [transformers4rec_tests] $ /bin/bash /tmp/jenkins2568516019992388527.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1 item |
Fixed #473
status/needs-review
Goals ⚽
Trainer was using only 1 GPU when DataParallel mode was set, fixed that so now training will be done on all the available GPUs when using DataParallel mode.
Implementation Details 🚧
The
Trainer
wraps the input model usingclass HFWrapper
which has an attribute namedmodule
. This overrides another attribute from HF transformers Trainer class with the same name. The overridden attribute prevents HF-transofrmer'sTrainer
class from wrapping the input model intorch.nn.DataParallel
within its_wrap_model()
function. Renaming theHFWrapper
attribute to something other thanmodule
solves the problem.Testing Details 🔍
Use any test including
Trainer.train()
and follow this steps:CUDA_VISIBLE_DEVICES=0,1
env variabletrainer.train()
nvidia-smi
to monitor both GPUs being used during training.