NVIDIA-Merlin · bbozkaya · Nov 9, 2022 · Nov 8, 2022 · Nov 8, 2022 · Nov 8, 2022
diff --git a/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb b/examples/end-to-end-session-based/03-Session-based-Yoochoose-multigpu-training-PyT.ipynb
@@ -0,0 +1,321 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0fb18ebc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copyright 2022 NVIDIA Corporation. All Rights Reserved.\n",
+    "#\n",
+    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+    "# you may not use this file except in compliance with the License.\n",
+    "# You may obtain a copy of the License at\n",
+    "#\n",
+    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing, software\n",
+    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+    "# See the License for the specific language governing permissions and\n",
+    "# limitations under the License.\n",
+    "# =============================================================================="
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "378ff149",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "<img src=\"https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_transformers4rec_end-to-end-session-based-02-end-to-end-session-based-with-yoochoose-pyt/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
+    "\n",
+    "# Multi-GPU training for session-based recommendations with PyTorch"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d8642b00",
+   "metadata": {},
+   "source": [
+    "In the previous two notebooks, we have first created sequential features and saved our processed data frames as parquet files. Then we used these processed parquet files in training a session-based recommendation model with XLNet architecture on a single GPU. We will now expand this exercise to perform a multi-GPU training with the same dataset using PyTorch.\n",
+    "\n",
+    "There are multiple ways to scale a training pipeline to multiple GPUs: \n",
+    "\n",
+    "- <b>Model Parallel</b>: If the model is too large to fit on a single GPU, the parameters are distributed over multiple GPUs. This is usually the case for the RecSys domain since the embedding tables can be exceptionally large and memory intensive.\n",
+    "- <b>Data Parallel</b>: Every GPU has a copy of all model parameters and runs the forward/backward pass for its batch. This is used when the model can fit in one GPU memory, but the dataset is too large and must be distributed over multiple GPUs.\n",
+    "\n",
+    "In this example, we demonstrate how to scale a training pipeline to multi-GPU, single node. The goal is to maximize throughput and reduce training time. In that way, models can be trained more frequently and researches can run more experiments in a shorter time duration. \n",
+    "\n",
+    "This is equivalent to training with a larger batch-size. As we are using more GPUs, we have more computational resources and can achieve higher throughput. In data parallel training, it is often required that all model parameters fit into a single GPU. Every worker (each GPU) has a copy of the model parameters and runs the forward pass on their local batch. The workers synchronize the gradients with each other, which can introduce an overhead.\n",
+    "\n",
+    "<b>Learning objectives</b> \n",
+    "- Scaling training pipeline to multiple GPUs\n",
+    "\n",
+    "<b>Prerequisites</b>\n",
+    "- Run the <b>01-ETL-with-NVTabular.ipynb</b> notebook first to generate the dataset and directories that are also needed by this notebook."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34bc0fdb",
+   "metadata": {},
+   "source": [
+    "### 1. Creating a multi-gpu training python script "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78d52653",
+   "metadata": {},
+   "source": [
+    "In this example, we will be using PyTorch, which expects all code including importing libraries, loading data, building the model and training it, to be in a single python script (.py file). We will then spawn torch.distributed.launch, which will distribute the training and evaluation tasks to 2 GPUs with the default <b>DistributedDataParallel</b> configuration.\n",
+    "\n",
+    "The following cell exports all related code to a <b>pyt_trainer.py</b> file to be created in the same working directory as this notebook. The code is structured as follows:\n",
+    "\n",
+    "- importing required libraries\n",
+    "- specifying and processing command line arguments\n",
+    "- specifying the schema file to load and filtering the features that will be used in training\n",
+    "- defining the input module\n",
+    "- specifying the prediction task\n",
+    "- defining the XLNet Transformer architecture configuration\n",
+    "- defining the model object and its training arguments\n",
+    "- creating a trainer object and running the training loop (over multiple days) with the trainer object\n",
+    "\n",
+    "All of these steps were covered in the previous two notebooks; feel free to visit the two notebooks to refresh the concepts for each step of the script.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff36ee4d-6b8a-4c03-b99c-7182c6894fbe",
+   "metadata": {},
+   "source": [
+    "Please note these <b>important points</b> that are relevant to multi-gpu training:\n",
+    "- <b>specifying multiple GPUs</b>: we added the setting CUDA_VISIBLE_DEVICES to \"0,1\" at the top of our script to indicate that we will be using 2 GPUs. PyTorch distributed launch environment will recognize this and perform the training loop on multiple GPUs (2 in this case) using different batches of the data in a data-parallel fashion.\n",
+    "- <b>data repartitioning</b>: when training on multiple GPUs, data must be re-partitioned into >1 partitions where the number of partitions must be at least equal to the number of GPUs. The torch utility library in Transformers4Rec does this automatically and outputs a UserWarning message. If you would like to avoid this warning message, you may choose to manually re-partition your data files before you launch the training loop or function. See [this document](https://nvidia-merlin.github.io/Transformers4Rec/main/multi_gpu_train.html#distributeddataparallel) for further information on how to do manual re-partitioning.\n",
+    "- <b>training and evaluation batch sizes</b>: in the default DistributedDataParallel mode we will be running, keeping the batch size unchanged means each worker will receive the same-size batch despite the fact that you are now using multiple GPUs. If you would like to keep the total batch size constant, you may want to divide the training and evaluation batch sizes by the number of GPUs you are running on, which is expected to reduce time it takes train and evaluate on each batch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6c45c899-9c88-4235-8402-03f7f3e40841",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%writefile './pyt_trainer.py'\n",
+    "\n",
+    "import os\n",
+    "\n",
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1\"\n",
+    "\n",
+    "import glob\n",
+    "import torch \n",
+    "\n",
+    "from transformers4rec import torch as tr\n",
+    "from transformers4rec.torch.ranking_metric import NDCGAt, AvgPrecisionAt, RecallAt\n",
+    "from transformers4rec.torch.utils.examples_utils import wipe_memory\n",
+    "\n",
+    "import argparse\n",
+    "\n",
+    "# define arguments that can be passed to this python script\n",
+    "parser = argparse.ArgumentParser(description='Hyperparameters for model training')\n",
+    "parser.add_argument('--local_rank', type=int, default=0)\n",
+    "parser.add_argument('--path', type=str, help='Directory with training and validation data')\n",
+    "parser.add_argument('--learning-rate', type=float, default=0.0005, help='Learning rate for training')\n",
+    "parser.add_argument('--per-device-train-batch-size', type=int, default=384, help='Per device batch size for training')\n",
+    "parser.add_argument('--per-device-eval-batch-size', type=int, default=512, help='Per device batch size for evaluation')\n",
+    "sh_args = parser.parse_args()\n",
+    "\n",
+    "# create the schema object by reading the schema.pbtxt file generated by NVTabular pipeline in the previous 01-ETL-with-NVTabular notebook\n",
+    "from merlin_standard_lib import Schema\n",
+    "SCHEMA_PATH = \"schema_demo.pb\"\n",
+    "schema = Schema().from_proto_text(SCHEMA_PATH)\n",
+    "\n",
+    "# select the subset of features we want to use for training the model by their tags or their names.\n",
+    "schema = schema.select_by_name(\n",
+    "   ['item_id-list_seq', 'category-list_seq', 'product_recency_days_log_norm-list_seq', 'et_dayofweek_sin-list_seq']\n",
+    ")\n",
+    "\n",
+    "max_sequence_length, d_model = 20, 320\n",
+    "# Define input module to process tabular input-features and to prepare masked inputs\n",
+    "input_module = tr.TabularSequenceFeatures.from_schema(\n",
+    "    schema,\n",
+    "    max_sequence_length=max_sequence_length,\n",
+    "    continuous_projection=64,\n",
+    "    aggregation=\"concat\",\n",
+    "    d_output=d_model,\n",
+    "    masking=\"mlm\",\n",
+    ")\n",
+    "\n",
+    "# Define Next item prediction-task \n",
+    "prediction_task = tr.NextItemPredictionTask(hf_format=True, weight_tying=True)\n",
+    "\n",
+    "# Define the config of the XLNet Transformer architecture\n",
+    "transformer_config = tr.XLNetConfig.build(\n",
+    "    d_model=d_model, n_head=8, n_layer=2, total_seq_length=max_sequence_length\n",
+    ")\n",
+    "\n",
+    "# Get the end-to-end model \n",
+    "model = transformer_config.to_torch_model(input_module, prediction_task)\n",
+    "\n",
+    "# Set training arguments \n",
+    "training_args = tr.trainer.T4RecTrainingArguments(\n",
+    "            output_dir=\"./tmp\",\n",
+    "            max_sequence_length=20,\n",
+    "            data_loader_engine='nvtabular',\n",
+    "            num_train_epochs=10, \n",
+    "            dataloader_drop_last=False,\n",
+    "            per_device_train_batch_size = sh_args.per_device_train_batch_size,\n",
+    "            per_device_eval_batch_size = sh_args.per_device_eval_batch_size,\n",
+    "            learning_rate=sh_args.learning_rate,\n",
+    "            report_to = [],\n",
+    "            logging_steps=200\n",
+    "        )\n",
+    "\n",
+    "# Instantiate the trainer\n",
+    "recsys_trainer = tr.Trainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    schema=schema,\n",
+    "    compute_metrics=True)\n",
+    "\n",
+    "# Set input and output directories\n",
+    "INPUT_DIR=sh_args.path\n",
+    "OUTPUT_DIR=sh_args.path\n",
+    "\n",
+    "import time\n",
+    "start = time.time()\n",
+    "\n",
+    "# main loop for training\n",
+    "start_time_window_index = 178\n",
+    "final_time_window_index = 181\n",
+    "# Iterating over days from 178 to 181\n",
+    "for time_index in range(start_time_window_index, final_time_window_index):\n",
+    "    # Set data \n",
+    "    time_index_train = time_index\n",
+    "    time_index_eval = time_index + 1\n",
+    "    train_paths = glob.glob(os.path.join(OUTPUT_DIR, f\"{time_index_train}/train.parquet\"))\n",
+    "    eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f\"{time_index_eval}/valid.parquet\"))\n",
+    "    print(train_paths)\n",
+    "    \n",
+    "    # Train on day related to time_index \n",
+    "    print('*'*20)\n",
+    "    print(\"Launch training for day %s are:\" %time_index)\n",
+    "    print('*'*20 + '\\n')\n",
+    "    recsys_trainer.train_dataset_or_path = train_paths\n",
+    "    recsys_trainer.reset_lr_scheduler()\n",
+    "    recsys_trainer.train()\n",
+    "    recsys_trainer.state.global_step +=1\n",
+    "    print('finished')\n",
+    "    \n",
+    "    # Evaluate on the following day\n",
+    "    recsys_trainer.eval_dataset_or_path = eval_paths\n",
+    "    train_metrics = recsys_trainer.evaluate(metric_key_prefix='eval')\n",
+    "    print('*'*20)\n",
+    "    print(\"Eval results for day %s are:\\t\" %time_index_eval)\n",
+    "    print('\\n' + '*'*20 + '\\n')\n",
+    "    for key in sorted(train_metrics.keys()):\n",
+    "        print(\" %s = %s\" % (key, str(train_metrics[key]))) \n",
+    "    wipe_memory()\n",
+    "\n",
+    "end = time.time()\n",
+    "print('Total training time:',end-start)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8c8cb68e-a445-439f-a2ae-9c19f3669593",
+   "metadata": {},
+   "source": [
+    "### 2. Executing the multi-gpu training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c57c0bcd-d9cb-4af5-9cc8-13e9783132d6",
+   "metadata": {},
+   "source": [
+    "You are now ready to execute the python script you created. Run the following shell command which will execute the script and perform data loading, model building and training using 2 GPUs.\n",
+    "\n",
+    "Note that there are four arguments you can pass to your python script, and only the first one is required:\n",
+    "- <b>path</b>: this argument specifies the directory in which to find the multi-day train and validation files to work on\n",
+    "- <b>learning rate</b>: you can experiment with different learning rates and see the effect of this hyperparameter when multiple GPUs are used in training (versus 1 GPU). Typically, increasing the learning rate (up to a certain level) as the number of GPUs is increased helps with the accuracy metrics.\n",
+    "- <b>per device batch size for training</b>: when using multiple GPUs in DistributedDataParallel mode, you may choose to reduce the batch size in order to keep the total batch size constant. This should help reduce training times.\n",
+    "- <b>per device batch size for evaluation</b>: see above"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1e5678d8-71a9-465f-a986-9ec68cf351ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python -m torch.distributed.launch --nproc_per_node 2 pyt_trainer.py --path \"./preproc_sessions_by_day\" --learning-rate 0.0005"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "847757ca-3c91-4504-b9dc-29127563da01",
+   "metadata": {},
+   "source": [
+    "<b>Congratulations!!!</b> You successfully trained your model using 2 GPUs with a data-parallel approach. If you choose, you may now go back and experiment with some of the hyperparameters (eg. learning rate, batch sizes, number of GPUs) to collect information on various accuracy metrics as well as total training time, to see what fits best into your workflow."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "68dd532f",
+   "metadata": {},
+   "source": [
+    "## References"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "30f9073e",
+   "metadata": {},
+   "source": [
+    "- Multi-GPU data-parallel training using the Trainer class https://nvidia-merlin.github.io/Transformers4Rec/main/multi_gpu_train.html\n",
+    "\n",
+    "- Merlin Transformers4rec: https://github.com/NVIDIA-Merlin/Transformers4Rec\n",
+    "\n",
+    "- Merlin NVTabular: https://github.com/NVIDIA-Merlin/NVTabular/tree/main/nvtabular"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88a6a17a-e952-481f-8ecf-b0f9e0eb9b13",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "interpreter": {
+   "hash": "7b543a88d374ac88bf8df97911b380f671b13649694a5b49eb21e60fd27eb479"
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}