Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature]Multi-GPU DistributedDataParallel Fixed #496

Merged
merged 5 commits into from
Oct 6, 2022
Merged

Conversation

nzarif
Copy link
Contributor

@nzarif nzarif commented Oct 5, 2022

Fixes #456

Goals ⚽

  1. Support Multi-GPU DistributedDataParallel training for next-item-prediction tasks using transformers4rec.Trainer class for training.
  2. Make sure the performance is as expected (i.e. faster training time than single GPU and DataParallel training).

Implementation Details 🚧

  • The dataset used with Multi-GPU DistributedDataParallel needs to have number of partitions equal or greater than number of GPUs i.e. dataloader.datatset.npartition>=dataloader.args.global_size. We must check the number of partitions of the dataset and re-partition it if needed.
  • When computing metrics in DistributedDataParallel mode, torch.cat(self.metric_mean, 0) failed. It was replaced by torchmetrics.utilities.data.dim_zero_cat() that has the same functionality but works for DistributedDataParallel too.
  • For the dataloader to work properly, correct values of global_rank and global_size must be passed to its constructor.

Testing Details 🔍

  • To test for DistributedDataParallel training make sure CUDA_VISIBLE_DEVICES is set correctly and run the script using torch distributed launch as shown below:

python -m torch.distributed.launch --nproc_per_node $N_GPUS$ your_script.py --arguments

@nzarif nzarif self-assigned this Oct 5, 2022
@nzarif nzarif changed the title Multi-GPU DistributedDataParallel Fixed [feature]Multi-GPU DistributedDataParallel Fixed Oct 5, 2022
@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #496 of commit b97a3eac5535f5419c18bc3836bc925d30a69323, no merge conflicts.
Running as SYSTEM
Setting status of b97a3eac5535f5419c18bc3836bc925d30a69323 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/198/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10
 > git rev-parse b97a3eac5535f5419c18bc3836bc925d30a69323^{commit} # timeout=10
Checking out Revision b97a3eac5535f5419c18bc3836bc925d30a69323 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f b97a3eac5535f5419c18bc3836bc925d30a69323 # timeout=10
Commit message: "Multi-GPU DistributedDataParallel Fixed"
 > git rev-list --no-walk f2a1cd5770f0d65274792b7142d4d8fd1b756761 # timeout=10
First time build. Skipping changelog.
[transformers4rec_tests] $ /bin/bash /tmp/jenkins7094064427877866399.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 36.26s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins186991052116221735.sh

@nzarif nzarif added the enhancement New feature or request label Oct 5, 2022
@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #496 of commit ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2, no merge conflicts.
Running as SYSTEM
Setting status of ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/199/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10
 > git rev-parse ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2^{commit} # timeout=10
Checking out Revision ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 # timeout=10
Commit message: "fixed formatting problems"
 > git rev-list --no-walk b97a3eac5535f5419c18bc3836bc925d30a69323 # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins13360986045226705283.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 36.21s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins10117483466868974023.sh

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #496 of commit d787c9ee19c27960c040c416a30e2d0c00a67b89, no merge conflicts.
Running as SYSTEM
Setting status of d787c9ee19c27960c040c416a30e2d0c00a67b89 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/200/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10
 > git rev-parse d787c9ee19c27960c040c416a30e2d0c00a67b89^{commit} # timeout=10
Checking out Revision d787c9ee19c27960c040c416a30e2d0c00a67b89 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f d787c9ee19c27960c040c416a30e2d0c00a67b89 # timeout=10
Commit message: "fixed with black and isort"
 > git rev-list --no-walk ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins2715036970437953748.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 36.39s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins883940003929113715.sh

@github-actions
Copy link

github-actions bot commented Oct 5, 2022

@@ -226,6 +226,9 @@ def __init__(

self.set_dataset(buffer_size, engine, reader_kwargs)

if (global_rank is not None) and (self.dataset.npartitions < global_size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a warning here instructing the user to save the parquet files in multiple partitions (row groups) for better performance. We can include in the warning an example on how to do saving with pandas or cudf.
df.to_parquet("filename.parquet", row_group_size=10000, engine="pyarrow")
The final number of partitions = number of rows / row_group_size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nzarif I recommend replacing print() by LOG.warning() as in this example

@@ -55,7 +56,7 @@ def update(self, preds: torch.Tensor, target: torch.Tensor, **kwargs): # type:

def compute(self):
# Computing the mean of the batch metrics (for each cut-off at topk)
return torch.cat(self.metric_mean, axis=0).mean(0)
return dim_zero_cat(self.metric_mean).mean(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fix is ok, as dim_zero_cat might be able to deal with both lists and tensors.
I am curious if you have tried to set self.add_state(..., dist_reduce_fx="mean") and if provides same accuracy but faster compute, bcs if metrics are averaged per GPU before being sync in compute() less data that would have communicated among GPUs.

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #496 of commit 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a, no merge conflicts.
Running as SYSTEM
Setting status of 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/201/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10
 > git rev-parse 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a^{commit} # timeout=10
Checking out Revision 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a # timeout=10
Commit message: "added user warning to repartition"
 > git rev-list --no-walk d787c9ee19c27960c040c416a30e2d0c00a67b89 # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins5153853890988784195.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 36.28s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins17329231584407662421.sh

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #496 of commit 3dded907d3d42ef32f30c4e77126889d8e1b6af4, no merge conflicts.
Running as SYSTEM
Setting status of 3dded907d3d42ef32f30c4e77126889d8e1b6af4 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/207/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10
 > git rev-parse 3dded907d3d42ef32f30c4e77126889d8e1b6af4^{commit} # timeout=10
Checking out Revision 3dded907d3d42ef32f30c4e77126889d8e1b6af4 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 3dded907d3d42ef32f30c4e77126889d8e1b6af4 # timeout=10
Commit message: "changed print to logger.warning"
 > git rev-list --no-walk cd455a1ab814ca2f6332069cdb673f3e28200306 # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins1459850529842728832.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 36.34s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins4446099888074706268.sh

@gabrielspmoreira gabrielspmoreira merged commit 73b1e31 into main Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task] Support of multi-gpu DistributedDataParallel training
3 participants