Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-gpu training example for T4Rec PyTorch #521

Merged
merged 5 commits into from
Nov 9, 2022
Merged

Conversation

bbozkaya
Copy link
Contributor

@bbozkaya bbozkaya commented Nov 8, 2022

Fixes #508
This notebook example showcases DistributedDataParallel functionality with multi-GPU (2 GPUs) training and evaluation.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #521 of commit 93c5bb5707a18286b184ee9a47b8487b71bbaba8, no merge conflicts.
Running as SYSTEM
Setting status of 93c5bb5707a18286b184ee9a47b8487b71bbaba8 to PENDING with url http://merlin-infra1.nvidia.com:8080/job/transformers4rec_tests/248/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/521/*:refs/remotes/origin/pr/521/* # timeout=10
 > git rev-parse 93c5bb5707a18286b184ee9a47b8487b71bbaba8^{commit} # timeout=10
Checking out Revision 93c5bb5707a18286b184ee9a47b8487b71bbaba8 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 93c5bb5707a18286b184ee9a47b8487b71bbaba8 # timeout=10
Commit message: "Add multi-gpu training example for T4Rec PyTorch"
 > git rev-list --no-walk 2118ed166b624d8511c269add03cb0ef9e8260a1 # timeout=10
First time build. Skipping changelog.
[transformers4rec_tests] $ /bin/bash /tmp/jenkins1440288146966703595.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-3.0.2, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 37.77s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins9632338189891584177.sh

@@ -0,0 +1,318 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following sentence, maybe worth adding that in the context of RecSys, the larger number of parameters sits on embedding tables

  • Model Parallel: If the model is too large to fit on a single GPU, the parameters are distributed over multiple GPUs

Reply via ReviewNB

@@ -0,0 +1,318 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • specifying multiple GPUs

I believe that when working with DistributedDataParallel the CUDA_VISIBLE_DEVICES env variable is not really considered by torch.distributed.launch, as the launcher spawns one process per GPU (which is provided via the --local_rank arg) . The number of GPUs is defined by the --nproc_per_node argument.

You say using different batches of the data in a model-parallel fashion,but that is in fact data -parallel fashion.

  • data repartitioning:

You could link here to this doc that explains how parquet data can be partitioned when saving

  • I think we need to provide instructions on how to download the data required to run this example (maybe pointing to the example notebook that explains that)

Reply via ReviewNB

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #521 of commit b7d496a1633fafc5bfef91a307b4c9117230e328, no merge conflicts.
Running as SYSTEM
Setting status of b7d496a1633fafc5bfef91a307b4c9117230e328 to PENDING with url http://merlin-infra1.nvidia.com:8080/job/transformers4rec_tests/250/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/521/*:refs/remotes/origin/pr/521/* # timeout=10
 > git rev-parse b7d496a1633fafc5bfef91a307b4c9117230e328^{commit} # timeout=10
Checking out Revision b7d496a1633fafc5bfef91a307b4c9117230e328 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f b7d496a1633fafc5bfef91a307b4c9117230e328 # timeout=10
Commit message: "Updated notebook text."
 > git rev-list --no-walk 2ac105763e603045679a4aa91e596c70d2ab01f0 # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins16900856128540027596.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-3.0.2, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 37.01s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins2660189828768308187.sh

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #521 of commit 987ddf0f833e1af111ec91e2ee4828006133476e, no merge conflicts.
Running as SYSTEM
Setting status of 987ddf0f833e1af111ec91e2ee4828006133476e to PENDING with url http://merlin-infra1.nvidia.com:8080/job/transformers4rec_tests/251/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/521/*:refs/remotes/origin/pr/521/* # timeout=10
 > git rev-parse 987ddf0f833e1af111ec91e2ee4828006133476e^{commit} # timeout=10
Checking out Revision 987ddf0f833e1af111ec91e2ee4828006133476e (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 987ddf0f833e1af111ec91e2ee4828006133476e # timeout=10
Commit message: "Fixed notebook text."
 > git rev-list --no-walk b7d496a1633fafc5bfef91a307b4c9117230e328 # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins7289922049230587541.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-3.0.2, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 37.81s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins7972687507251577647.sh

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

@@ -0,0 +1,321 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could replace the last sentence of this paragraph by saying that data parallel is useful when you want to speed-up the training/evaluation of data leveraging multiple GPUs in parallel (as typically data won't fit into GPU memory, that is why models are trained on batches)

  • Data Parallel: Every GPU has a copy of all model parameters and runs the forward/backward pass for its batch. This is used when the model can fit in one GPU memory, but the dataset is too large and must be distributed over multiple GPUs.


Reply via ReviewNB

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #521 of commit 7890438793c24c76a1e6901477ee7d7ebd18dbcf, no merge conflicts.
Running as SYSTEM
Setting status of 7890438793c24c76a1e6901477ee7d7ebd18dbcf to PENDING with url http://merlin-infra1.nvidia.com:8080/job/transformers4rec_tests/252/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/521/*:refs/remotes/origin/pr/521/* # timeout=10
 > git rev-parse 7890438793c24c76a1e6901477ee7d7ebd18dbcf^{commit} # timeout=10
Checking out Revision 7890438793c24c76a1e6901477ee7d7ebd18dbcf (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 7890438793c24c76a1e6901477ee7d7ebd18dbcf # timeout=10
Commit message: "Merge branch 'main' into multi_gpu3"
 > git rev-list --no-walk 987ddf0f833e1af111ec91e2ee4828006133476e # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins4676104782042372006.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-3.0.2, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 37.65s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins9177471191383896204.sh

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #521 of commit 095208557c6db3962eb2eb3b84445bb174ed248b, no merge conflicts.
Running as SYSTEM
Setting status of 095208557c6db3962eb2eb3b84445bb174ed248b to PENDING with url http://merlin-infra1.nvidia.com:8080/job/transformers4rec_tests/253/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/521/*:refs/remotes/origin/pr/521/* # timeout=10
 > git rev-parse 095208557c6db3962eb2eb3b84445bb174ed248b^{commit} # timeout=10
Checking out Revision 095208557c6db3962eb2eb3b84445bb174ed248b (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 095208557c6db3962eb2eb3b84445bb174ed248b # timeout=10
Commit message: "Update nb wrt Gabriel's comments"
 > git rev-list --no-walk 7890438793c24c76a1e6901477ee7d7ebd18dbcf # timeout=10
[transformers4rec_tests] $ /bin/bash /tmp/jenkins689654318164210636.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-3.0.2, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py . [100%]

============================== 1 passed in 37.11s ==============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins10922989830204022273.sh

@bbozkaya bbozkaya merged commit 3b17f6c into main Nov 9, 2022
@bbozkaya bbozkaya deleted the multi_gpu3 branch November 9, 2022 00:59
@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #521 of commit 095208557c6db3962eb2eb3b84445bb174ed248b, has merge conflicts.
Running as SYSTEM
PR has already been merged, builds using the merged sha1 will fail!!!
Setting status of 095208557c6db3962eb2eb3b84445bb174ed248b to PENDING with url http://merlin-infra1.nvidia.com:8080/job/transformers4rec_tests/270/ and message: 'Build started for original commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/transformers4rec_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git init /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Transformers4Rec.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Transformers4Rec.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Transformers4Rec.git +refs/pull/521/*:refs/remotes/origin/pr/521/* # timeout=10
 > git rev-parse 095208557c6db3962eb2eb3b84445bb174ed248b^{commit} # timeout=10
Checking out Revision 095208557c6db3962eb2eb3b84445bb174ed248b (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 095208557c6db3962eb2eb3b84445bb174ed248b # timeout=10
Commit message: "Update nb wrt Gabriel's comments"
 > git rev-list --no-walk 8846e74299a854c209bc0fdd36d8b9acb9a3d4da # timeout=10
First time build. Skipping changelog.
[transformers4rec_tests] $ /bin/bash /tmp/jenkins17456291754993862696.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/NVIDIA-Merlin/NVTabular.git
  Cloning https://github.com/NVIDIA-Merlin/NVTabular.git to /tmp/pip-req-build-a05zhqjs
  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA-Merlin/NVTabular.git /tmp/pip-req-build-a05zhqjs
  Resolved https://github.com/NVIDIA-Merlin/NVTabular.git to commit ba4c14159a8e858c8998d4158a4376e65a8fa266
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Requirement already satisfied: merlin-dataloader>=0.0.2 in /usr/local/lib/python3.8/dist-packages (from nvtabular==1.6.0+4.gba4c1415) (0.0.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from nvtabular==1.6.0+4.gba4c1415) (1.8.1)
Requirement already satisfied: merlin-core>=0.2.0 in /usr/local/lib/python3.8/dist-packages (from nvtabular==1.6.0+4.gba4c1415) (0.6.0+1.g5926fcf)
Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (3.19.5)
Requirement already satisfied: numba>=0.54 in /usr/local/lib/python3.8/dist-packages (from merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (0.56.2)
Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (7.0.0)
Requirement already satisfied: distributed>=2022.3.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (2022.5.1)
Requirement already satisfied: pandas<1.4.0dev0,>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.3.5)
Requirement already satisfied: dask>=2022.3.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (2022.5.1)
Requirement already satisfied: tqdm>=4.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (4.64.1)
Requirement already satisfied: betterproto<2.0.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.2.5)
Requirement already satisfied: tensorflow-metadata>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.10.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (21.3)
Requirement already satisfied: fsspec==2022.5.0 in /usr/local/lib/python3.8/dist-packages (from merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (2022.5.0)
Requirement already satisfied: numpy<1.25.0,>=1.17.3 in /usr/local/lib/python3.8/dist-packages (from scipy->nvtabular==1.6.0+4.gba4c1415) (1.22.4)
Requirement already satisfied: grpclib in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (0.4.3)
Requirement already satisfied: stringcase in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.2.0)
Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (5.4.1)
Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.8/dist-packages (from dask>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (0.12.0)
Requirement already satisfied: cloudpickle>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (2.2.0)
Requirement already satisfied: partd>=0.3.10 in /usr/local/lib/python3.8/dist-packages (from dask>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.3.0)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.26.12)
Requirement already satisfied: tornado>=6.0.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (6.2)
Requirement already satisfied: zict>=0.1.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (2.2.0)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (2.4.0)
Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (8.1.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (3.1.2)
Requirement already satisfied: psutil>=5.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (5.9.2)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.7.0)
Requirement already satisfied: locket>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.0.0)
Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.0.4)
Requirement already satisfied: setuptools<60 in /usr/lib/python3/dist-packages (from numba>=0.54->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (45.2.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.8/dist-packages (from numba>=0.54->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (4.12.0)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.8/dist-packages (from numba>=0.54->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (0.39.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (2022.2.1)
Requirement already satisfied: absl-py<2.0.0,>=0.9 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.2.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.52.0)
Requirement already satisfied: six>=1.5 in /var/jenkins_home/.local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas<1.4.0dev0,>=1.2.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.15.0)
Requirement already satisfied: heapdict in /usr/local/lib/python3.8/dist-packages (from zict>=0.1.3->distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (1.0.1)
Requirement already satisfied: h2<5,>=3.1.0 in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (4.1.0)
Requirement already satisfied: multidict in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (6.0.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/dist-packages (from importlib-metadata->numba>=0.54->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (3.8.1)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2->distributed>=2022.3.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (2.1.1)
Requirement already satisfied: hpack<5,>=4.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (4.0.0)
Requirement already satisfied: hyperframe<7,>=6.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->merlin-core>=0.2.0->nvtabular==1.6.0+4.gba4c1415) (6.0.1)
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/transformers4rec_tests/transformers4rec
plugins: anyio-3.6.1, xdist-3.0.2, cov-4.0.0
collected 1 item

tests/unit/test_notebooks.py F [100%]

=================================== FAILURES ===================================
_________________________________ test_session _________________________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-8/test_session0')

@pytest.mark.skipif(importlib.util.find_spec("cudf") is None, reason="needs cudf")
def test_session(tmpdir):
    BASE_PATH = os.path.join(dirname(TEST_PATH), SESSION_PATH)
    os.environ["INPUT_DATA_DIR"] = "/tmp/data/"
    # Run ETL
    nb_path = os.path.join(BASE_PATH, "01-ETL-with-NVTabular.ipynb")
    _run_notebook(tmpdir, nb_path)

    # Run session based
    torch = importlib.util.find_spec("torch")
    if torch is not None:
        os.environ["INPUT_SCHEMA_PATH"] = BASE_PATH + "schema.pb"
        nb_path = os.path.join(BASE_PATH, "02-session-based-XLNet-with-PyT.ipynb")
      _run_notebook(tmpdir, nb_path)

tests/unit/test_notebooks.py:44:


tests/unit/test_notebooks.py:66: in _run_notebook
subprocess.check_output([sys.executable, script_path])
/usr/lib/python3.8/subprocess.py:415: in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,


input = None, capture_output = False, timeout = None, check = True
popenargs = (['/usr/bin/python', '/tmp/pytest-of-jenkins/pytest-8/test_session0/notebook.py'],)
kwargs = {'stdout': -1}, process = <subprocess.Popen object at 0x7fecdd926430>
stdout = b"['/tmp/data//sessions_by_day/1/train.parquet']\n********************\nLaunch training for day 1 are:\n********************\n\n"
stderr = None, retcode = 1

def run(*popenargs,
        input=None, capture_output=False, timeout=None, check=False, **kwargs):
    """Run command with arguments and return a CompletedProcess instance.

    The returned instance will have attributes args, returncode, stdout and
    stderr. By default, stdout and stderr are not captured, and those attributes
    will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them.

    If check is True and the exit code was non-zero, it raises a
    CalledProcessError. The CalledProcessError object will have the return code
    in the returncode attribute, and output & stderr attributes if those streams
    were captured.

    If timeout is given, and the process takes too long, a TimeoutExpired
    exception will be raised.

    There is an optional argument "input", allowing you to
    pass bytes or a string to the subprocess's stdin.  If you use this argument
    you may not also use the Popen constructor's "stdin" argument, as
    it will be used internally.

    By default, all communication is in bytes, and therefore any "input" should
    be bytes, and the stdout and stderr will be bytes. If in text mode, any
    "input" should be a string, and stdout and stderr will be strings decoded
    according to locale encoding, or by "encoding" if set. Text mode is
    triggered by setting any of text, encoding, errors or universal_newlines.

    The other arguments are the same as for the Popen constructor.
    """
    if input is not None:
        if kwargs.get('stdin') is not None:
            raise ValueError('stdin and input arguments may not both be used.')
        kwargs['stdin'] = PIPE

    if capture_output:
        if kwargs.get('stdout') is not None or kwargs.get('stderr') is not None:
            raise ValueError('stdout and stderr arguments may not be used '
                             'with capture_output.')
        kwargs['stdout'] = PIPE
        kwargs['stderr'] = PIPE

    with Popen(*popenargs, **kwargs) as process:
        try:
            stdout, stderr = process.communicate(input, timeout=timeout)
        except TimeoutExpired as exc:
            process.kill()
            if _mswindows:
                # Windows accumulates the output in a single blocking
                # read() call run on child threads, with the timeout
                # being done in a join() on those threads.  communicate()
                # _after_ kill() is required to collect that and add it
                # to the exception.
                exc.stdout, exc.stderr = process.communicate()
            else:
                # POSIX _communicate already populated the output so
                # far into the TimeoutExpired exception.
                process.wait()
            raise
        except:  # Including KeyboardInterrupt, communicate handled that.
            process.kill()
            # We don't call process.wait() as .__exit__ does that for us.
            raise
        retcode = process.poll()
        if check and retcode:
          raise CalledProcessError(retcode, process.args,
                                     output=stdout, stderr=stderr)

E subprocess.CalledProcessError: Command '['/usr/bin/python', '/tmp/pytest-of-jenkins/pytest-8/test_session0/notebook.py']' returned non-zero exit status 1.

/usr/lib/python3.8/subprocess.py:516: CalledProcessError
----------------------------- Captured stderr call -----------------------------
/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.ITEM: 'item'>, <Tags.ID: 'id'>].
warnings.warn(
/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.ITEM: 'item'>, <Tags.ID: 'id'>].
warnings.warn(
/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.ITEM: 'item'>, <Tags.ID: 'id'>].
warnings.warn(

Creating time-based splits: 0%| | 0/9 [00:00<?, ?it/s]
Creating time-based splits: 11%|█ | 1/9 [00:00<00:01, 5.90it/s]
Creating time-based splits: 22%|██▏ | 2/9 [00:00<00:01, 6.22it/s]
Creating time-based splits: 33%|███▎ | 3/9 [00:00<00:00, 6.64it/s]
Creating time-based splits: 44%|████▍ | 4/9 [00:00<00:00, 6.77it/s]
Creating time-based splits: 56%|█████▌ | 5/9 [00:00<00:00, 6.56it/s]
Creating time-based splits: 67%|██████▋ | 6/9 [00:00<00:00, 6.87it/s]
Creating time-based splits: 78%|███████▊ | 7/9 [00:01<00:00, 7.22it/s]
Creating time-based splits: 89%|████████▉ | 8/9 [00:01<00:00, 7.34it/s]
Creating time-based splits: 100%|██████████| 9/9 [00:01<00:00, 7.45it/s]
Creating time-based splits: 100%|██████████| 9/9 [00:01<00:00, 7.01it/s]
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
***** Running training *****
Num examples = 1664
Num Epochs = 5
Instantaneous batch size per device = 128
Total train batch size (w. parallel, distributed & accumulation) = 128
Gradient Accumulation steps = 1
Total optimization steps = 65

0%| | 0/65 [00:00<?, ?it/s]Traceback (most recent call last):
File "/tmp/pytest-of-jenkins/pytest-8/test_session0/notebook.py", line 202, in
trainer.train()
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1316, in train
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1849, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1881, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/torch/trainer.py", line 830, in forward
return self.wrapper_module(inputs, *args)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/torch/model/base.py", line 553, in forward
head(inputs, call_body=True, training=training, always_output_dict=True, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/torch/model/base.py", line 398, in forward
body_outputs = self.body(body_outputs, training=training, ignore_masking=ignore_masking)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/config/schema.py", line 50, in call
return super().call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/torch/block/base.py", line 152, in forward
input = module(input, training=training, ignore_masking=ignore_masking)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/config/schema.py", line 50, in call
return super().call(*args, **kwargs)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/torch/tabular/base.py", line 390, in call
outputs = super().call(inputs, *args, **kwargs) # noqa
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/torch/features/sequence.py", line 251, in forward
outputs = super(TabularSequenceFeatures, self).forward(inputs)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/torch/tabular/base.py", line 602, in forward
outputs.update(layer(inputs))
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/config/schema.py", line 50, in call
return super().call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/torch/block/base.py", line 152, in forward
input = module(input, training=training, ignore_masking=ignore_masking)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/config/schema.py", line 50, in call
return super().call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/torch/block/base.py", line 148, in forward
input = module(input, **filtered_kwargs)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/config/schema.py", line 50, in call
return super().call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/var/jenkins_home/workspace/transformers4rec_tests/transformers4rec/transformers4rec/torch/block/base.py", line 156, in forward
input = module(input)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Float but found Double

0%| | 0/65 [00:00<?, ?it/s]
============================== 1 failed in 22.04s ==============================
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=2 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Transformers4Rec/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[transformers4rec_tests] $ /bin/bash /tmp/jenkins1807308667872591452.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task] Add multi-GPU example for Transformer4Rec PyTorch
5 participants