Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cropped the table and addressed comments from previous PR #510

Merged
merged 4 commits into from
Mar 21, 2023

Conversation

nzarif
Copy link
Contributor

@nzarif nzarif commented Oct 25, 2022

Goals ⚽

  • Notify users about impact of multi_gpu training on metrics and how to address it (tune learning rate).
  • Cropped the empty white space out of performance table.

@nzarif nzarif requested a review from sararb October 25, 2022 21:28
@nzarif nzarif changed the title cropped the perf table and addressed comments left from previous PR [documentation] Cropped the table and addressed comments from previous PR Oct 25, 2022
@nzarif nzarif changed the title [documentation] Cropped the table and addressed comments from previous PR Cropped the table and addressed comments from previous PR Oct 25, 2022
@nzarif nzarif added the documentation Improvements or additions to documentation label Oct 25, 2022
@github-actions
Copy link

These experiments used a machine with 2 <i>Tesla V100-SXM2-32GB-LS</i> GPUs.

<b>Note:</b> Using Data-Parallel training can impact metrics such at NDCG and recall. Therefore, Users are advised to tune the learning rate for better accuracy.
Copy link
Contributor

@rnyak rnyak Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some recommendations/more clarification about that? for example, if we increase the gpu number 2x, do we expect learning rate to be increased or reduced accordingly?

can you make Users lower case? like Therefore, users are ...

can you please add learning_rate column to the table and put what learning rate was used in different configurations based on num of GPUs? that's the critical part. thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases we should decrease learning rate as we increase number of GPUs but I did not observe any specific pattern in how learning rate must be changed. It differs based on model type, batch_size and etc. For example I saw a model where the same metrics could be achieved with 2 GPUs with the same learning rate. I also saw an example where for 2 GPUs, learning rate must be multiplied by 0.75.
I can add learning_rate to the table, but would that be enough? Should I mention any other parameter of the too? I have not modified all the different parameters when running the script, so I was not 100% sure if learning rate only depends on nr_gpu+model_type+batch_size or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nzarif I understand from this new statement that this unpredictable effect of LR applies for DataParallel and not for DistributedDataParallel right? Can we just keep the LR we use with single GPU when we using DistributedDataParallel and the results (loss/accuracy) match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loss does not change much between single GPU, DataParallel and DistributedDataParallel but accuracy and NDCG changes considerably. It changes for both DataParallel and DistributedDataParallel assuming in each case, each GPU is receiving the same batch-size. This means each GPU is processing the same batch-size on each step so when more than 1 GPU is used, the effective batch-size processed at once is larger. But it does not translate to the same learning_rate producing same accuracy/NDCG/recall for both DataParallel and DistributedDataParallel and I believe the reason lies in the different approach these modes have for aggregation after processing a batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did play around with the learning_rate for a couple of models and I could find learning_rates that gave same metrics for each of the modes. I did not record the learning_rates back then but if you would like to, I can retry doing that and come up with some examples of learning_rates that produce similar metrics for each of the training modes (single GPU, DataParallel and DistributedDataParallel).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nzarif I'd say adding LR is important and if you used WandB they all should be registered there in your experiments. SO you can get it from there. In addition, you need to write down which dataset you used when preparing the table. It is missing. if you generated data syntetically you can add the generation script in the readme as well), if you used yoochose dataset, please mention it.

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #510 of commit e943edd2ff80c338fe63a2e17c3539c31d186289, no merge conflicts.
Running as SYSTEM
Setting status of e943edd2ff80c338fe63a2e17c3539c31d186289 to PENDING with url http://merlin-infra1.nvidia.com:8080/job/transformers4rec_tests/241/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/510/*:refs/remotes/origin/pr/510/* # timeout=10
 > git rev-parse e943edd2ff80c338fe63a2e17c3539c31d186289^{commit} # timeout=10
Checking out Revision e943edd2ff80c338fe63a2e17c3539c31d186289 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f e943edd2ff80c338fe63a2e17c3539c31d186289 # timeout=10
Commit message: "Merge branch 'main' into multi_gpu_doc_fix"
 > git rev-list --no-walk a399724271a5c77c0fc25f9873afc1456e003f6e # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins18015599307978890609.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-3.0.2, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 35.20s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins15984097388296273611.sh

@nzarif nzarif self-assigned this Nov 23, 2022
@karlhigley
Copy link
Contributor

@rnyak @gabrielspmoreira Is this PR still relevant or has it been superseded by other work? Should we push it forward toward merging or close it as outdated?

@karlhigley karlhigley merged commit 985f551 into main Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation status/needs-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants