-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cropped the table and addressed comments from previous PR #510
Conversation
Documentation previewhttps://nvidia-merlin.github.io/Transformers4Rec/review/pr-510 |
These experiments used a machine with 2 <i>Tesla V100-SXM2-32GB-LS</i> GPUs. | ||
|
||
<b>Note:</b> Using Data-Parallel training can impact metrics such at NDCG and recall. Therefore, Users are advised to tune the learning rate for better accuracy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some recommendations/more clarification about that? for example, if we increase the gpu number 2x, do we expect learning rate to be increased or reduced accordingly?
can you make Users
lower case? like Therefore, users are ...
can you please add learning_rate
column to the table and put what learning rate was used in different configurations based on num of GPUs? that's the critical part. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases we should decrease learning rate as we increase number of GPUs but I did not observe any specific pattern in how learning rate must be changed. It differs based on model type, batch_size and etc. For example I saw a model where the same metrics could be achieved with 2 GPUs with the same learning rate. I also saw an example where for 2 GPUs, learning rate must be multiplied by 0.75.
I can add learning_rate to the table, but would that be enough? Should I mention any other parameter of the too? I have not modified all the different parameters when running the script, so I was not 100% sure if learning rate only depends on nr_gpu+model_type+batch_size or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nzarif I understand from this new statement that this unpredictable effect of LR applies for DataParallel
and not for DistributedDataParallel
right? Can we just keep the LR we use with single GPU when we using DistributedDataParallel
and the results (loss/accuracy) match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loss does not change much between single GPU, DataParallel
and DistributedDataParallel
but accuracy and NDCG changes considerably. It changes for both DataParallel
and DistributedDataParallel
assuming in each case, each GPU is receiving the same batch-size. This means each GPU is processing the same batch-size on each step so when more than 1 GPU is used, the effective batch-size processed at once is larger. But it does not translate to the same learning_rate producing same accuracy/NDCG/recall for both DataParallel
and DistributedDataParallel
and I believe the reason lies in the different approach these modes have for aggregation after processing a batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did play around with the learning_rate for a couple of models and I could find learning_rates that gave same metrics for each of the modes. I did not record the learning_rates back then but if you would like to, I can retry doing that and come up with some examples of learning_rates that produce similar metrics for each of the training modes (single GPU, DataParallel
and DistributedDataParallel
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nzarif I'd say adding LR is important and if you used WandB they all should be registered there in your experiments. SO you can get it from there. In addition, you need to write down which dataset you used when preparing the table. It is missing. if you generated data syntetically you can add the generation script in the readme as well), if you used yoochose
dataset, please mention it.
Click to view CI ResultsGitHub pull request #510 of commit e943edd2ff80c338fe63a2e17c3539c31d186289, no merge conflicts. Running as SYSTEM Setting status of e943edd2ff80c338fe63a2e17c3539c31d186289 to PENDING with url http://merlin-infra1.nvidia.com:8080/job/transformers4rec_tests/241/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/510/*:refs/remotes/origin/pr/510/* # timeout=10 > git rev-parse e943edd2ff80c338fe63a2e17c3539c31d186289^{commit} # timeout=10 Checking out Revision e943edd2ff80c338fe63a2e17c3539c31d186289 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f e943edd2ff80c338fe63a2e17c3539c31d186289 # timeout=10 Commit message: "Merge branch 'main' into multi_gpu_doc_fix" > git rev-list --no-walk a399724271a5c77c0fc25f9873afc1456e003f6e # timeout=10 [transformers4rec_tests] $ /bin/bash /tmp/jenkins18015599307978890609.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec plugins: anyio-3.6.1, xdist-3.0.2, cov-4.0.0 collected 1 item |
@rnyak @gabrielspmoreira Is this PR still relevant or has it been superseded by other work? Should we push it forward toward merging or close it as outdated? |
Goals ⚽