From 93c5bb5707a18286b184ee9a47b8487b71bbaba8 Mon Sep 17 00:00:00 2001
From: root <bbozkaya@gmail.com>
Date: Tue, 8 Nov 2022 19:23:13 +0000
Subject: [PATCH 1/4] Add multi-gpu training example for T4Rec PyTorch

---
 ...ased-Yoochoose-multigpu-training-PyT.ipynb | 318 ++++++++++++++++++
 1 file changed, 318 insertions(+)
 create mode 100644 examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
diff --git a/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb b/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
new file mode 100644
index 0000000000..eb9c7c2e6d
--- /dev/null
+++ b/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
@@ -0,0 +1,318 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0fb18ebc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copyright 2022 NVIDIA Corporation. All Rights Reserved.\n",
+    "#\n",
+    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+    "# you may not use this file except in compliance with the License.\n",
+    "# You may obtain a copy of the License at\n",
+    "#\n",
+    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing, software\n",
+    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+    "# See the License for the specific language governing permissions and\n",
+    "# limitations under the License.\n",
+    "# =============================================================================="
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "378ff149",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "<img src=\"https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_transformers4rec_end-to-end-session-based-02-end-to-end-session-based-with-yoochoose-pyt/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
+    "\n",
+    "# Multi-GPU training for session-based recommendations with PyTorch"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d8642b00",
+   "metadata": {},
+   "source": [
+    "In the previous two notebooks, we have first created sequential features and saved our processed data frames as parquet files. Then we used these processed parquet files in training a session-based recommendation model with XLNet architecture on a single GPU. We will now expand this exercise to perform a multi-GPU training with the same dataset using PyTorch.\n",
+    "\n",
+    "There are multiple ways to scale a training pipeline to multiple GPUs: \n",
+    "\n",
+    "- Model Parallel: If the model is too large to fit on a single GPU, the parameters are distributed over multiple GPUs\n",
+    "- Data Parallel: Every GPU has a copy of all model parameters and runs the forward/backward pass for its batch.\n",
+    "\n",
+    "In this example, we demonstrate how to scale a training pipeline to multi-GPU, single node. The goal is to maximize throughput and reduce training time. In that way, models can be trained more frequently and researches can run more experiments in a shorter time duration. \n",
+    "\n",
+    "This is equivalent to training with a larger batch-size. As we are using more GPUs, we have more computational resources and can achieve higher throughput. In data parallel training, it is often required that all model parameters fit into a single GPU. Every worker (each GPU) has a copy of the model parameters and runs the forward pass on their local batch. The workers synchronize the gradients with each other, which can introduce an overhead.\n",
+    "\n",
+    "<b>Learning objectives</b> \n",
+    "- Scaling training pipeline to multiple GPUs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34bc0fdb",
+   "metadata": {},
+   "source": [
+    "### 1. Creating a multi-gpu training python script "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78d52653",
+   "metadata": {},
+   "source": [
+    "In this example, we will be using PyTorch, which expects all code including importing libraries, loading data, building the model and training it, to be in a single python script (.py file). We will then spawn torch.distributed.launch, which will distribute the training and evaluation tasks to 2 GPUs with the default <b>DistributedDataParallel</b> configuration.\n",
+    "\n",
+    "The following cell exports all related code to a <b>pyt_trainer.py</b> file to be created in the same working directory as this notebook. The code is structured as follows:\n",
+    "\n",
+    "- importing required libraries\n",
+    "- specifiying and processing command line arguments\n",
+    "- specifying the schema file to load and filtering the features that will be used in training\n",
+    "- defining the input module\n",
+    "- specifying the prediction task\n",
+    "- defining the XLNet Transformer architecture configuration\n",
+    "- defining the model object and its training arguments\n",
+    "- creating a trainer object and running the training loop (over multiple days) with the trainer object\n",
+    "\n",
+    "All of these steps were covered in the previous two notebooks; feel free to visit the two notebooks to refresh the concepts for each step of the script.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff36ee4d-6b8a-4c03-b99c-7182c6894fbe",
+   "metadata": {},
+   "source": [
+    "Please note these <b>important points</b> that are relevant to multi-gpu training:\n",
+    "- <b>specifying multiple GPUs</b>: we added the setting CUDA_VISIBLE_DEVICES to \"0,1\" at the top of our script to indicate that we will be using 2 GPUs. PyTorch distributed launch environment will recognize this and perform the training loop on multiple GPUs (2 in this case) using different batches of the data in a model-parallel fashion.\n",
+    "- <b>data repartitioning</b>: when training on multiple GPUs, data must be re-partitioned into >1 partitions where the number of partitions must be at least equal to the number of GPUs. The torch utility library in Transformers4Rec does this automatically and outputs a UserWarning message. If you would like to avoid this warning message, you may choose to manually re-partition your data files before you launch the training loop or function.\n",
+    "- <b>training and evaluation batch sizes</b>: in the default DistributedDataParallel mode we will be running, keeping the batch size unchanged means each worker will receive the same-size batch despite the fact that you are now using multiple GPUs. If you would like to keep the total batch size constant, you may want to divide the training and evaluation batch sizes by the number of GPUs you are running on, which is expected to reduce time it takes train and evaluate on each batch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6c45c899-9c88-4235-8402-03f7f3e40841",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%writefile './pyt_trainer.py'\n",
+    "\n",
+    "import os\n",
+    "\n",
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1\"\n",
+    "\n",
+    "import glob\n",
+    "import torch \n",
+    "\n",
+    "from transformers4rec import torch as tr\n",
+    "from transformers4rec.torch.ranking_metric import NDCGAt, AvgPrecisionAt, RecallAt\n",
+    "from transformers4rec.torch.utils.examples_utils import wipe_memory\n",
+    "\n",
+    "import argparse\n",
+    "\n",
+    "# define arguments that can be passed to this python script\n",
+    "parser = argparse.ArgumentParser(description='Hyperparameters for model training')\n",
+    "parser.add_argument('--local_rank', type=int, default=0)\n",
+    "parser.add_argument('--path', type=str, help='Directory with training and validation data')\n",
+    "parser.add_argument('--learning-rate', type=float, default=0.0005, help='Learning rate for training')\n",
+    "parser.add_argument('--per-device-train-batch-size', type=int, default=384, help='Per device batch size for training')\n",
+    "parser.add_argument('--per-device-eval-batch-size', type=int, default=512, help='Per device batch size for evaluation')\n",
+    "sh_args = parser.parse_args()\n",
+    "\n",
+    "# create the schema object by reading the schema.pbtxt file generated by NVTabular pipeline in the previous 01-ETL-with-NVTabular notebook\n",
+    "from merlin_standard_lib import Schema\n",
+    "SCHEMA_PATH = \"schema_demo.pb\"\n",
+    "schema = Schema().from_proto_text(SCHEMA_PATH)\n",
+    "\n",
+    "# select the subset of features we want to use for training the model by their tags or their names.\n",
+    "schema = schema.select_by_name(\n",
+    "   ['item_id-list_seq', 'category-list_seq', 'product_recency_days_log_norm-list_seq', 'et_dayofweek_sin-list_seq']\n",
+    ")\n",
+    "\n",
+    "max_sequence_length, d_model = 20, 320\n",
+    "# Define input module to process tabular input-features and to prepare masked inputs\n",
+    "input_module = tr.TabularSequenceFeatures.from_schema(\n",
+    "    schema,\n",
+    "    max_sequence_length=max_sequence_length,\n",
+    "    continuous_projection=64,\n",
+    "    aggregation=\"concat\",\n",
+    "    d_output=d_model,\n",
+    "    masking=\"mlm\",\n",
+    ")\n",
+    "\n",
+    "# Define Next item prediction-task \n",
+    "prediction_task = tr.NextItemPredictionTask(hf_format=True, weight_tying=True)\n",
+    "\n",
+    "# Define the config of the XLNet Transformer architecture\n",
+    "transformer_config = tr.XLNetConfig.build(\n",
+    "    d_model=d_model, n_head=8, n_layer=2, total_seq_length=max_sequence_length\n",
+    ")\n",
+    "\n",
+    "# Get the end-to-end model \n",
+    "model = transformer_config.to_torch_model(input_module, prediction_task)\n",
+    "\n",
+    "# Set training arguments \n",
+    "training_args = tr.trainer.T4RecTrainingArguments(\n",
+    "            output_dir=\"./tmp\",\n",
+    "            max_sequence_length=20,\n",
+    "            data_loader_engine='nvtabular',\n",
+    "            num_train_epochs=10, \n",
+    "            dataloader_drop_last=False,\n",
+    "            per_device_train_batch_size = sh_args.per_device_train_batch_size,\n",
+    "            per_device_eval_batch_size = sh_args.per_device_eval_batch_size,\n",
+    "            learning_rate=sh_args.learning_rate,\n",
+    "            report_to = [],\n",
+    "            logging_steps=200\n",
+    "        )\n",
+    "\n",
+    "# Instantiate the trainer\n",
+    "recsys_trainer = tr.Trainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    schema=schema,\n",
+    "    compute_metrics=True)\n",
+    "\n",
+    "# Set input and output directories\n",
+    "INPUT_DIR=sh_args.path\n",
+    "OUTPUT_DIR=sh_args.path\n",
+    "\n",
+    "import time\n",
+    "start = time.time()\n",
+    "\n",
+    "# main loop for training\n",
+    "start_time_window_index = 178\n",
+    "final_time_window_index = 181\n",
+    "# Iterating over days from 178 to 181\n",
+    "for time_index in range(start_time_window_index, final_time_window_index):\n",
+    "    # Set data \n",
+    "    time_index_train = time_index\n",
+    "    time_index_eval = time_index + 1\n",
+    "    train_paths = glob.glob(os.path.join(OUTPUT_DIR, f\"{time_index_train}/train.parquet\"))\n",
+    "    eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f\"{time_index_eval}/valid.parquet\"))\n",
+    "    print(train_paths)\n",
+    "    \n",
+    "    # Train on day related to time_index \n",
+    "    print('*'*20)\n",
+    "    print(\"Launch training for day %s are:\" %time_index)\n",
+    "    print('*'*20 + '\\n')\n",
+    "    recsys_trainer.train_dataset_or_path = train_paths\n",
+    "    recsys_trainer.reset_lr_scheduler()\n",
+    "    recsys_trainer.train()\n",
+    "    recsys_trainer.state.global_step +=1\n",
+    "    print('finished')\n",
+    "    \n",
+    "    # Evaluate on the following day\n",
+    "    recsys_trainer.eval_dataset_or_path = eval_paths\n",
+    "    train_metrics = recsys_trainer.evaluate(metric_key_prefix='eval')\n",
+    "    print('*'*20)\n",
+    "    print(\"Eval results for day %s are:\\t\" %time_index_eval)\n",
+    "    print('\\n' + '*'*20 + '\\n')\n",
+    "    for key in sorted(train_metrics.keys()):\n",
+    "        print(\" %s = %s\" % (key, str(train_metrics[key]))) \n",
+    "    wipe_memory()\n",
+    "\n",
+    "end = time.time()\n",
+    "print('Total training time:',end-start)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8c8cb68e-a445-439f-a2ae-9c19f3669593",
+   "metadata": {},
+   "source": [
+    "### 2. Executing the multi-gpu training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c57c0bcd-d9cb-4af5-9cc8-13e9783132d6",
+   "metadata": {},
+   "source": [
+    "You are now ready to execute the python script you created. Run the following shell command which will execute the script and perform data loading, model building and training using 2 GPUs.\n",
+    "\n",
+    "Note that there are four arguments you can pass to your python script, and only the first one is required:\n",
+    "- <b>path</b>: this argument specifies the directory in which to find the multi-day train and validation files to work on\n",
+    "- <b>learning rate</b>: you can experiment with different learning rates and see the effect of this hyperparameter when multiple GPUs are used in training (versus 1 GPU). Typically, increasing the learning rate (up to a certain level) as the number of GPUs is increased helps with the accuracy metrics.\n",
+    "- <b>per device batch size for training</b>: when using multiple GPUs in DistributedDataParallel mode, you may choose to reduce the batch size in order to keep the total batch size constant. This should help reduce training times.\n",
+    "- <b>per device batch size for evaluation</b>: see above"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1e5678d8-71a9-465f-a986-9ec68cf351ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python -m torch.distributed.launch --nproc_per_node 2 pyt_trainer.py --path \"./preproc_sessions_by_day\" --learning-rate 0.0005"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "847757ca-3c91-4504-b9dc-29127563da01",
+   "metadata": {},
+   "source": [
+    "<b>Congratulations!!!</b> You successfully trained your model using 2 GPUs with a data-parallel approach. If you choose, you may now go back and experiment with some of the hyperparameters (eg. learning rate, batch sizes, number of GPUs) to collect information on various accuracy metrics as well as total training time, to see what fits best into your workflow."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "68dd532f",
+   "metadata": {},
+   "source": [
+    "## References"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "30f9073e",
+   "metadata": {},
+   "source": [
+    "- Multi-GPU data-parallel training using the Trainer class https://nvidia-merlin.github.io/Transformers4Rec/main/multi_gpu_train.html\n",
+    "\n",
+    "- Merlin Transformers4rec: https://github.com/NVIDIA-Merlin/Transformers4Rec\n",
+    "\n",
+    "- Merlin NVTabular: https://github.com/NVIDIA-Merlin/NVTabular/tree/main/nvtabular"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88a6a17a-e952-481f-8ecf-b0f9e0eb9b13",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "interpreter": {
+   "hash": "7b543a88d374ac88bf8df97911b380f671b13649694a5b49eb21e60fd27eb479"
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From b7d496a1633fafc5bfef91a307b4c9117230e328 Mon Sep 17 00:00:00 2001
From: root <bbozkaya@gmail.com>
Date: Tue, 8 Nov 2022 20:58:43 +0000
Subject: [PATCH 2/4] Updated notebook text.

---
 ...on-based-Yoochoose-multigpu-training-PyT.ipynb | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb b/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
index eb9c7c2e6d..f5b95ae57d 100644
--- a/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
+++ b/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
@@ -44,15 +44,18 @@
     "\n",
     "There are multiple ways to scale a training pipeline to multiple GPUs: \n",
     "\n",
-    "- Model Parallel: If the model is too large to fit on a single GPU, the parameters are distributed over multiple GPUs\n",
-    "- Data Parallel: Every GPU has a copy of all model parameters and runs the forward/backward pass for its batch.\n",
+    "- <b>Model Parallel</b>: If the model is too large to fit on a single GPU, the parameters are distributed over multiple GPUs. This is usually the case for the RecSys domain since the embedding tables can be exceptionally large and memory intensive.\n",
+    "- <b>Data Parallel</b>: Every GPU has a copy of all model parameters and runs the forward/backward pass for its batch. This is used when the model can fit in one GPU memory, but the dataset is too large and must be distributed over multiple GPUs.\n",
     "\n",
     "In this example, we demonstrate how to scale a training pipeline to multi-GPU, single node. The goal is to maximize throughput and reduce training time. In that way, models can be trained more frequently and researches can run more experiments in a shorter time duration. \n",
     "\n",
     "This is equivalent to training with a larger batch-size. As we are using more GPUs, we have more computational resources and can achieve higher throughput. In data parallel training, it is often required that all model parameters fit into a single GPU. Every worker (each GPU) has a copy of the model parameters and runs the forward pass on their local batch. The workers synchronize the gradients with each other, which can introduce an overhead.\n",
     "\n",
     "<b>Learning objectives</b> \n",
-    "- Scaling training pipeline to multiple GPUs"
+    "- Scaling training pipeline to multiple GPUs\n",
+    "\n",
+    "<b>Prerequisities</b>\n",
+    "- Run the <b>01-ETL-with-NVTabular.ipynb</b> notebook first to generate the dataset and directories that are also needed by this notebook."
    ]
   },
   {
@@ -73,7 +76,7 @@
     "The following cell exports all related code to a <b>pyt_trainer.py</b> file to be created in the same working directory as this notebook. The code is structured as follows:\n",
     "\n",
     "- importing required libraries\n",
-    "- specifiying and processing command line arguments\n",
+    "- specifying and processing command line arguments\n",
     "- specifying the schema file to load and filtering the features that will be used in training\n",
     "- defining the input module\n",
     "- specifying the prediction task\n",
@@ -90,8 +93,8 @@
    "metadata": {},
    "source": [
     "Please note these <b>important points</b> that are relevant to multi-gpu training:\n",
-    "- <b>specifying multiple GPUs</b>: we added the setting CUDA_VISIBLE_DEVICES to \"0,1\" at the top of our script to indicate that we will be using 2 GPUs. PyTorch distributed launch environment will recognize this and perform the training loop on multiple GPUs (2 in this case) using different batches of the data in a model-parallel fashion.\n",
-    "- <b>data repartitioning</b>: when training on multiple GPUs, data must be re-partitioned into >1 partitions where the number of partitions must be at least equal to the number of GPUs. The torch utility library in Transformers4Rec does this automatically and outputs a UserWarning message. If you would like to avoid this warning message, you may choose to manually re-partition your data files before you launch the training loop or function.\n",
+    "- <b>specifying multiple GPUs</b>: we added the setting CUDA_VISIBLE_DEVICES to \"0,1\" at the top of our script to indicate that we will be using 2 GPUs. PyTorch distributed launch environment will recognize this and perform the training loop on multiple GPUs (2 in this case) using different batches of the data in a data-parallel fashion.\n",
+    "- <b>data repartitioning</b>: when training on multiple GPUs, data must be re-partitioned into >1 partitions where the number of partitions must be at least equal to the number of GPUs. The torch utility library in Transformers4Rec does this automatically and outputs a UserWarning message. If you would like to avoid this warning message, you may choose to manually re-partition your data files before you launch the training loop or function. See [this document](https://nvidia-merlin.github.io/Transformers4Rec/main/multi_gpu_train.html#distributeddataparallel) for further information on how to do manual re-partitioning.\n",
     "- <b>training and evaluation batch sizes</b>: in the default DistributedDataParallel mode we will be running, keeping the batch size unchanged means each worker will receive the same-size batch despite the fact that you are now using multiple GPUs. If you would like to keep the total batch size constant, you may want to divide the training and evaluation batch sizes by the number of GPUs you are running on, which is expected to reduce time it takes train and evaluate on each batch."
    ]
   },

From 987ddf0f833e1af111ec91e2ee4828006133476e Mon Sep 17 00:00:00 2001
From: root <bbozkaya@gmail.com>
Date: Tue, 8 Nov 2022 21:12:08 +0000
Subject: [PATCH 3/4] Fixed notebook text.

---
 .../03-Session-based-Yoochoose-multigpu-training-PyT.ipynb      | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb b/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
index f5b95ae57d..ae14de7572 100644
--- a/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
+++ b/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
@@ -54,7 +54,7 @@
     "<b>Learning objectives</b> \n",
     "- Scaling training pipeline to multiple GPUs\n",
     "\n",
-    "<b>Prerequisities</b>\n",
+    "<b>Prerequisites</b>\n",
     "- Run the <b>01-ETL-with-NVTabular.ipynb</b> notebook first to generate the dataset and directories that are also needed by this notebook."
    ]
   },

From 095208557c6db3962eb2eb3b84445bb174ed248b Mon Sep 17 00:00:00 2001
From: rnyak <ronayak@hotmail.com>
Date: Tue, 8 Nov 2022 15:38:20 -0800
Subject: [PATCH 4/4] Update nb wrt Gabriel's comments

---
 ...ased-Yoochoose-multigpu-training-PyT.ipynb | 21 ++++++-------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb b/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
index ae14de7572..bf6c30c0a4 100644
--- a/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
+++ b/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
@@ -32,7 +32,9 @@
    "source": [
     "<img src=\"https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_transformers4rec_end-to-end-session-based-02-end-to-end-session-based-with-yoochoose-pyt/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
     "\n",
-    "# Multi-GPU training for session-based recommendations with PyTorch"
+    "# Multi-GPU training for session-based recommendations with PyTorch\n",
+    "\n",
+    "This notebook was prepared by using the latest [merlin-pytorch:22.XX](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-pytorch/tags) container."
    ]
   },
   {
@@ -45,7 +47,7 @@
     "There are multiple ways to scale a training pipeline to multiple GPUs: \n",
     "\n",
     "- <b>Model Parallel</b>: If the model is too large to fit on a single GPU, the parameters are distributed over multiple GPUs. This is usually the case for the RecSys domain since the embedding tables can be exceptionally large and memory intensive.\n",
-    "- <b>Data Parallel</b>: Every GPU has a copy of all model parameters and runs the forward/backward pass for its batch. This is used when the model can fit in one GPU memory, but the dataset is too large and must be distributed over multiple GPUs.\n",
+    "- <b>Data Parallel</b>: Every GPU has a copy of all model parameters and runs the forward/backward pass for its batch. Data parallel is useful when you want to speed-up the training/evaluation of data leveraging multiple GPUs in parallel (as typically data won't fit into GPU memory, that is why models are trained on batches).\n",
     "\n",
     "In this example, we demonstrate how to scale a training pipeline to multi-GPU, single node. The goal is to maximize throughput and reduce training time. In that way, models can be trained more frequently and researches can run more experiments in a shorter time duration. \n",
     "\n",
@@ -93,7 +95,7 @@
    "metadata": {},
    "source": [
     "Please note these <b>important points</b> that are relevant to multi-gpu training:\n",
-    "- <b>specifying multiple GPUs</b>: we added the setting CUDA_VISIBLE_DEVICES to \"0,1\" at the top of our script to indicate that we will be using 2 GPUs. PyTorch distributed launch environment will recognize this and perform the training loop on multiple GPUs (2 in this case) using different batches of the data in a data-parallel fashion.\n",
+    "- <b>specifying multiple GPUs</b>: PyTorch distributed launch environment will recognize that we have two GPUs, since the `--nproc_per_node` arg of `torch.distributed.launch` takes care of assigning one GPU per process and performs the training loop on multiple GPUs (2 in this case) using different batches of the data in a data-parallel fashion.\n",
     "- <b>data repartitioning</b>: when training on multiple GPUs, data must be re-partitioned into >1 partitions where the number of partitions must be at least equal to the number of GPUs. The torch utility library in Transformers4Rec does this automatically and outputs a UserWarning message. If you would like to avoid this warning message, you may choose to manually re-partition your data files before you launch the training loop or function. See [this document](https://nvidia-merlin.github.io/Transformers4Rec/main/multi_gpu_train.html#distributeddataparallel) for further information on how to do manual re-partitioning.\n",
     "- <b>training and evaluation batch sizes</b>: in the default DistributedDataParallel mode we will be running, keeping the batch size unchanged means each worker will receive the same-size batch despite the fact that you are now using multiple GPUs. If you would like to keep the total batch size constant, you may want to divide the training and evaluation batch sizes by the number of GPUs you are running on, which is expected to reduce time it takes train and evaluate on each batch."
    ]
@@ -108,9 +110,6 @@
     "%%writefile './pyt_trainer.py'\n",
     "\n",
     "import os\n",
-    "\n",
-    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1\"\n",
-    "\n",
     "import glob\n",
     "import torch \n",
     "\n",
@@ -262,7 +261,7 @@
    "id": "847757ca-3c91-4504-b9dc-29127563da01",
    "metadata": {},
    "source": [
-    "<b>Congratulations!!!</b> You successfully trained your model using 2 GPUs with a data-parallel approach. If you choose, you may now go back and experiment with some of the hyperparameters (eg. learning rate, batch sizes, number of GPUs) to collect information on various accuracy metrics as well as total training time, to see what fits best into your workflow."
+    "<b>Congratulations!!!</b> You successfully trained your model using 2 GPUs with a `distributed data parallel` approach. If you choose, you may now go back and experiment with some of the hyperparameters (eg. learning rate, batch sizes, number of GPUs) to collect information on various accuracy metrics as well as total training time, to see what fits best into your workflow."
    ]
   },
   {
@@ -284,14 +283,6 @@
     "\n",
     "- Merlin NVTabular: https://github.com/NVIDIA-Merlin/NVTabular/tree/main/nvtabular"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "88a6a17a-e952-481f-8ecf-b0f9e0eb9b13",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {