Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] trainer.model.module renamed and DataParallel mode fixed #483

Merged
merged 2 commits into from
Sep 8, 2022

Conversation

nzarif
Copy link
Contributor

@nzarif nzarif commented Sep 1, 2022

Fixed #473
status/needs-review

Goals ⚽

Trainer was using only 1 GPU when DataParallel mode was set, fixed that so now training will be done on all the available GPUs when using DataParallel mode.

Implementation Details 🚧

The Trainer wraps the input model using class HFWrapper which has an attribute named module. This overrides another attribute from HF transformers Trainer class with the same name. The overridden attribute prevents HF-transofrmer's Trainer class from wrapping the input model in torch.nn.DataParallel within its _wrap_model() function. Renaming the HFWrapper attribute to something other than module solves the problem.

Testing Details 🔍

Use any test including Trainer.train() and follow this steps:

  1. Set CUDA_VISIBLE_DEVICES=0,1 env variable
  2. Train the model with trainer.train()
  3. use nvidia-smi to monitor both GPUs being used during training.

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #483 of commit 0aaf6e3643798429cfe19231ffac705f616621a8, no merge conflicts.
Running as SYSTEM
Setting status of 0aaf6e3643798429cfe19231ffac705f616621a8 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/193/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/483/*:refs/remotes/origin/pr/483/* # timeout=10
 > git rev-parse 0aaf6e3643798429cfe19231ffac705f616621a8^{commit} # timeout=10
Checking out Revision 0aaf6e3643798429cfe19231ffac705f616621a8 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 0aaf6e3643798429cfe19231ffac705f616621a8 # timeout=10
Commit message: "trainer.model.module renamed and DataParallel mode fixed"
 > git rev-list --no-walk 9978964cc04329b588d8d56a59904aee49af58eb # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins16552449295514341877.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 36.93s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins5492005579345750188.sh

@nzarif nzarif changed the title trainer.model.module renamed and DataParallel mode fixed [bug]trainer.model.module renamed and DataParallel mode fixed Sep 1, 2022
@nzarif nzarif changed the title [bug]trainer.model.module renamed and DataParallel mode fixed [BUG] trainer.model.module renamed and DataParallel mode fixed Sep 1, 2022
@nzarif nzarif added the bug Something isn't working label Sep 1, 2022
@github-actions
Copy link

github-actions bot commented Sep 1, 2022

Copy link
Member

@gabrielspmoreira gabrielspmoreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @nzarif . Thanks for the fix!

It worth highlighting that you have tested this fix with our integration tests, which uses the script for paper reproducibility, and they started using multiple GPU when available. You mentioned me that only TransformerXL (transfoxl) raised an error in DataParallel mode, which is a know issue we had identified in the past (May 2021) during the paper experiments, as documented in the Notes here

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #483 of commit e5d579aa58d5b906996d3ff66c4a1587cebc4352, no merge conflicts.
Running as SYSTEM
Setting status of e5d579aa58d5b906996d3ff66c4a1587cebc4352 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/194/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/483/*:refs/remotes/origin/pr/483/* # timeout=10
 > git rev-parse e5d579aa58d5b906996d3ff66c4a1587cebc4352^{commit} # timeout=10
Checking out Revision e5d579aa58d5b906996d3ff66c4a1587cebc4352 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f e5d579aa58d5b906996d3ff66c4a1587cebc4352 # timeout=10
Commit message: "Merge branch 'main' into dataparallel_fix"
 > git rev-list --no-walk 0aaf6e3643798429cfe19231ffac705f616621a8 # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins2568516019992388527.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 36.99s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins1731631416843921066.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] DataParallel training with Trainer is not using multiple GPUs
6 participants