-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature]Multi-GPU DistributedDataParallel Fixed #496
Conversation
Click to view CI ResultsGitHub pull request #496 of commit b97a3eac5535f5419c18bc3836bc925d30a69323, no merge conflicts. Running as SYSTEM Setting status of b97a3eac5535f5419c18bc3836bc925d30a69323 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/198/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10 > git rev-parse b97a3eac5535f5419c18bc3836bc925d30a69323^{commit} # timeout=10 Checking out Revision b97a3eac5535f5419c18bc3836bc925d30a69323 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f b97a3eac5535f5419c18bc3836bc925d30a69323 # timeout=10 Commit message: "Multi-GPU DistributedDataParallel Fixed" > git rev-list --no-walk f2a1cd5770f0d65274792b7142d4d8fd1b756761 # timeout=10 First time build. Skipping changelog. [transformers4rec_tests] $ /bin/bash /tmp/jenkins7094064427877866399.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0 collected 1 item |
Click to view CI ResultsGitHub pull request #496 of commit ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2, no merge conflicts. Running as SYSTEM Setting status of ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/199/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10 > git rev-parse ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2^{commit} # timeout=10 Checking out Revision ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 # timeout=10 Commit message: "fixed formatting problems" > git rev-list --no-walk b97a3eac5535f5419c18bc3836bc925d30a69323 # timeout=10 [transformers4rec_tests] $ /bin/bash /tmp/jenkins13360986045226705283.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0 collected 1 item |
Click to view CI ResultsGitHub pull request #496 of commit d787c9ee19c27960c040c416a30e2d0c00a67b89, no merge conflicts. Running as SYSTEM Setting status of d787c9ee19c27960c040c416a30e2d0c00a67b89 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/200/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10 > git rev-parse d787c9ee19c27960c040c416a30e2d0c00a67b89^{commit} # timeout=10 Checking out Revision d787c9ee19c27960c040c416a30e2d0c00a67b89 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f d787c9ee19c27960c040c416a30e2d0c00a67b89 # timeout=10 Commit message: "fixed with black and isort" > git rev-list --no-walk ff2561f94c0f1afcafcd39ef85b89eaf4e2bc7d2 # timeout=10 [transformers4rec_tests] $ /bin/bash /tmp/jenkins2715036970437953748.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0 collected 1 item |
Documentation previewhttps://nvidia-merlin.github.io/Transformers4Rec/review/pr-496 |
@@ -226,6 +226,9 @@ def __init__( | |||
|
|||
self.set_dataset(buffer_size, engine, reader_kwargs) | |||
|
|||
if (global_rank is not None) and (self.dataset.npartitions < global_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a warning here instructing the user to save the parquet files in multiple partitions (row groups) for better performance. We can include in the warning an example on how to do saving with pandas or cudf.
df.to_parquet("filename.parquet", row_group_size=10000, engine="pyarrow")
The final number of partitions = number of rows / row_group_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -55,7 +56,7 @@ def update(self, preds: torch.Tensor, target: torch.Tensor, **kwargs): # type: | |||
|
|||
def compute(self): | |||
# Computing the mean of the batch metrics (for each cut-off at topk) | |||
return torch.cat(self.metric_mean, axis=0).mean(0) | |||
return dim_zero_cat(self.metric_mean).mean(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fix is ok, as dim_zero_cat
might be able to deal with both lists and tensors.
I am curious if you have tried to set self.add_state(..., dist_reduce_fx="mean")
and if provides same accuracy but faster compute, bcs if metrics are averaged per GPU before being sync in compute() less data that would have communicated among GPUs.
Click to view CI ResultsGitHub pull request #496 of commit 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a, no merge conflicts. Running as SYSTEM Setting status of 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/201/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10 > git rev-parse 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a^{commit} # timeout=10 Checking out Revision 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 7e7e5e3e90d0c972e80303dc8f4422e132ac6b2a # timeout=10 Commit message: "added user warning to repartition" > git rev-list --no-walk d787c9ee19c27960c040c416a30e2d0c00a67b89 # timeout=10 [transformers4rec_tests] $ /bin/bash /tmp/jenkins5153853890988784195.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0 collected 1 item |
Click to view CI ResultsGitHub pull request #496 of commit 3dded907d3d42ef32f30c4e77126889d8e1b6af4, no merge conflicts. Running as SYSTEM Setting status of 3dded907d3d42ef32f30c4e77126889d8e1b6af4 to PENDING with url http://10.20.17.181:8080/job/transformers4rec_tests/207/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/496/*:refs/remotes/origin/pr/496/* # timeout=10 > git rev-parse 3dded907d3d42ef32f30c4e77126889d8e1b6af4^{commit} # timeout=10 Checking out Revision 3dded907d3d42ef32f30c4e77126889d8e1b6af4 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 3dded907d3d42ef32f30c4e77126889d8e1b6af4 # timeout=10 Commit message: "changed print to logger.warning" > git rev-list --no-walk cd455a1ab814ca2f6332069cdb673f3e28200306 # timeout=10 [transformers4rec_tests] $ /bin/bash /tmp/jenkins1459850529842728832.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-4.0.0 collected 1 item |
Fixes #456
Goals ⚽
DistributedDataParallel
training for next-item-prediction tasks usingtransformers4rec.Trainer
class for training.DataParallel
training).Implementation Details 🚧
DistributedDataParallel
needs to have number of partitions equal or greater than number of GPUs i.e.dataloader.datatset.npartition>=dataloader.args.global_size
. We must check the number of partitions of the dataset and re-partition it if needed.DistributedDataParallel
mode,torch.cat(self.metric_mean, 0)
failed. It was replaced bytorchmetrics.utilities.data.dim_zero_cat()
that has the same functionality but works forDistributedDataParallel
too.global_rank
andglobal_size
must be passed to its constructor.Testing Details 🔍
DistributedDataParallel
training make sureCUDA_VISIBLE_DEVICES
is set correctly and run the script using torch distributed launch as shown below:python -m torch.distributed.launch --nproc_per_node $N_GPUS$ your_script.py --arguments