Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-gpu training example for T4Rec PyTorch #521

Merged
merged 5 commits into from
Nov 9, 2022
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,321 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following sentence, maybe worth adding that in the context of RecSys, the larger number of parameters sits on embedding tables

  • Model Parallel: If the model is too large to fit on a single GPU, the parameters are distributed over multiple GPUs

Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • specifying multiple GPUs

I believe that when working with DistributedDataParallel the CUDA_VISIBLE_DEVICES env variable is not really considered by torch.distributed.launch, as the launcher spawns one process per GPU (which is provided via the --local_rank arg) . The number of GPUs is defined by the --nproc_per_node argument.

You say using different batches of the data in a model-parallel fashion,but that is in fact data -parallel fashion.

  • data repartitioning:

You could link here to this doc that explains how parquet data can be partitioned when saving

  • I think we need to provide instructions on how to download the data required to run this example (maybe pointing to the example notebook that explains that)

Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could replace the last sentence of this paragraph by saying that data parallel is useful when you want to speed-up the training/evaluation of data leveraging multiple GPUs in parallel (as typically data won't fit into GPU memory, that is why models are trained on batches)

  • Data Parallel: Every GPU has a copy of all model parameters and runs the forward/backward pass for its batch. This is used when the model can fit in one GPU memory, but the dataset is too large and must be distributed over multiple GPUs.


Reply via ReviewNB

"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "0fb18ebc",
"metadata": {},
"outputs": [],
"source": [
"# Copyright 2022 NVIDIA Corporation. All Rights Reserved.\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# http://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License.\n",
"# =============================================================================="
]
},
{
"cell_type": "markdown",
"id": "378ff149",
"metadata": {
"tags": []
},
"source": [
"<img src=\"https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_transformers4rec_end-to-end-session-based-02-end-to-end-session-based-with-yoochoose-pyt/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
"\n",
"# Multi-GPU training for session-based recommendations with PyTorch"
]
},
{
"cell_type": "markdown",
"id": "d8642b00",
"metadata": {},
"source": [
"In the previous two notebooks, we have first created sequential features and saved our processed data frames as parquet files. Then we used these processed parquet files in training a session-based recommendation model with XLNet architecture on a single GPU. We will now expand this exercise to perform a multi-GPU training with the same dataset using PyTorch.\n",
"\n",
"There are multiple ways to scale a training pipeline to multiple GPUs: \n",
"\n",
"- <b>Model Parallel</b>: If the model is too large to fit on a single GPU, the parameters are distributed over multiple GPUs. This is usually the case for the RecSys domain since the embedding tables can be exceptionally large and memory intensive.\n",
"- <b>Data Parallel</b>: Every GPU has a copy of all model parameters and runs the forward/backward pass for its batch. This is used when the model can fit in one GPU memory, but the dataset is too large and must be distributed over multiple GPUs.\n",
"\n",
"In this example, we demonstrate how to scale a training pipeline to multi-GPU, single node. The goal is to maximize throughput and reduce training time. In that way, models can be trained more frequently and researches can run more experiments in a shorter time duration. \n",
"\n",
"This is equivalent to training with a larger batch-size. As we are using more GPUs, we have more computational resources and can achieve higher throughput. In data parallel training, it is often required that all model parameters fit into a single GPU. Every worker (each GPU) has a copy of the model parameters and runs the forward pass on their local batch. The workers synchronize the gradients with each other, which can introduce an overhead.\n",
"\n",
"<b>Learning objectives</b> \n",
"- Scaling training pipeline to multiple GPUs\n",
"\n",
"<b>Prerequisites</b>\n",
"- Run the <b>01-ETL-with-NVTabular.ipynb</b> notebook first to generate the dataset and directories that are also needed by this notebook."
]
},
{
"cell_type": "markdown",
"id": "34bc0fdb",
"metadata": {},
"source": [
"### 1. Creating a multi-gpu training python script "
]
},
{
"cell_type": "markdown",
"id": "78d52653",
"metadata": {},
"source": [
"In this example, we will be using PyTorch, which expects all code including importing libraries, loading data, building the model and training it, to be in a single python script (.py file). We will then spawn torch.distributed.launch, which will distribute the training and evaluation tasks to 2 GPUs with the default <b>DistributedDataParallel</b> configuration.\n",
"\n",
"The following cell exports all related code to a <b>pyt_trainer.py</b> file to be created in the same working directory as this notebook. The code is structured as follows:\n",
"\n",
"- importing required libraries\n",
"- specifying and processing command line arguments\n",
"- specifying the schema file to load and filtering the features that will be used in training\n",
"- defining the input module\n",
"- specifying the prediction task\n",
"- defining the XLNet Transformer architecture configuration\n",
"- defining the model object and its training arguments\n",
"- creating a trainer object and running the training loop (over multiple days) with the trainer object\n",
"\n",
"All of these steps were covered in the previous two notebooks; feel free to visit the two notebooks to refresh the concepts for each step of the script.\n"
]
},
{
"cell_type": "markdown",
"id": "ff36ee4d-6b8a-4c03-b99c-7182c6894fbe",
"metadata": {},
"source": [
"Please note these <b>important points</b> that are relevant to multi-gpu training:\n",
"- <b>specifying multiple GPUs</b>: we added the setting CUDA_VISIBLE_DEVICES to \"0,1\" at the top of our script to indicate that we will be using 2 GPUs. PyTorch distributed launch environment will recognize this and perform the training loop on multiple GPUs (2 in this case) using different batches of the data in a data-parallel fashion.\n",
"- <b>data repartitioning</b>: when training on multiple GPUs, data must be re-partitioned into >1 partitions where the number of partitions must be at least equal to the number of GPUs. The torch utility library in Transformers4Rec does this automatically and outputs a UserWarning message. If you would like to avoid this warning message, you may choose to manually re-partition your data files before you launch the training loop or function. See [this document](https://nvidia-merlin.github.io/Transformers4Rec/main/multi_gpu_train.html#distributeddataparallel) for further information on how to do manual re-partitioning.\n",
"- <b>training and evaluation batch sizes</b>: in the default DistributedDataParallel mode we will be running, keeping the batch size unchanged means each worker will receive the same-size batch despite the fact that you are now using multiple GPUs. If you would like to keep the total batch size constant, you may want to divide the training and evaluation batch sizes by the number of GPUs you are running on, which is expected to reduce time it takes train and evaluate on each batch."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c45c899-9c88-4235-8402-03f7f3e40841",
"metadata": {},
"outputs": [],
"source": [
"%%writefile './pyt_trainer.py'\n",
"\n",
"import os\n",
"\n",
"os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1\"\n",
"\n",
"import glob\n",
"import torch \n",
"\n",
"from transformers4rec import torch as tr\n",
"from transformers4rec.torch.ranking_metric import NDCGAt, AvgPrecisionAt, RecallAt\n",
"from transformers4rec.torch.utils.examples_utils import wipe_memory\n",
"\n",
"import argparse\n",
"\n",
"# define arguments that can be passed to this python script\n",
"parser = argparse.ArgumentParser(description='Hyperparameters for model training')\n",
"parser.add_argument('--local_rank', type=int, default=0)\n",
"parser.add_argument('--path', type=str, help='Directory with training and validation data')\n",
"parser.add_argument('--learning-rate', type=float, default=0.0005, help='Learning rate for training')\n",
"parser.add_argument('--per-device-train-batch-size', type=int, default=384, help='Per device batch size for training')\n",
"parser.add_argument('--per-device-eval-batch-size', type=int, default=512, help='Per device batch size for evaluation')\n",
"sh_args = parser.parse_args()\n",
"\n",
"# create the schema object by reading the schema.pbtxt file generated by NVTabular pipeline in the previous 01-ETL-with-NVTabular notebook\n",
"from merlin_standard_lib import Schema\n",
"SCHEMA_PATH = \"schema_demo.pb\"\n",
"schema = Schema().from_proto_text(SCHEMA_PATH)\n",
"\n",
"# select the subset of features we want to use for training the model by their tags or their names.\n",
"schema = schema.select_by_name(\n",
" ['item_id-list_seq', 'category-list_seq', 'product_recency_days_log_norm-list_seq', 'et_dayofweek_sin-list_seq']\n",
")\n",
"\n",
"max_sequence_length, d_model = 20, 320\n",
"# Define input module to process tabular input-features and to prepare masked inputs\n",
"input_module = tr.TabularSequenceFeatures.from_schema(\n",
" schema,\n",
" max_sequence_length=max_sequence_length,\n",
" continuous_projection=64,\n",
" aggregation=\"concat\",\n",
" d_output=d_model,\n",
" masking=\"mlm\",\n",
")\n",
"\n",
"# Define Next item prediction-task \n",
"prediction_task = tr.NextItemPredictionTask(hf_format=True, weight_tying=True)\n",
"\n",
"# Define the config of the XLNet Transformer architecture\n",
"transformer_config = tr.XLNetConfig.build(\n",
" d_model=d_model, n_head=8, n_layer=2, total_seq_length=max_sequence_length\n",
")\n",
"\n",
"# Get the end-to-end model \n",
"model = transformer_config.to_torch_model(input_module, prediction_task)\n",
"\n",
"# Set training arguments \n",
"training_args = tr.trainer.T4RecTrainingArguments(\n",
" output_dir=\"./tmp\",\n",
" max_sequence_length=20,\n",
" data_loader_engine='nvtabular',\n",
" num_train_epochs=10, \n",
" dataloader_drop_last=False,\n",
" per_device_train_batch_size = sh_args.per_device_train_batch_size,\n",
" per_device_eval_batch_size = sh_args.per_device_eval_batch_size,\n",
" learning_rate=sh_args.learning_rate,\n",
" report_to = [],\n",
" logging_steps=200\n",
" )\n",
"\n",
"# Instantiate the trainer\n",
"recsys_trainer = tr.Trainer(\n",
" model=model,\n",
" args=training_args,\n",
" schema=schema,\n",
" compute_metrics=True)\n",
"\n",
"# Set input and output directories\n",
"INPUT_DIR=sh_args.path\n",
"OUTPUT_DIR=sh_args.path\n",
"\n",
"import time\n",
"start = time.time()\n",
"\n",
"# main loop for training\n",
"start_time_window_index = 178\n",
"final_time_window_index = 181\n",
"# Iterating over days from 178 to 181\n",
"for time_index in range(start_time_window_index, final_time_window_index):\n",
" # Set data \n",
" time_index_train = time_index\n",
" time_index_eval = time_index + 1\n",
" train_paths = glob.glob(os.path.join(OUTPUT_DIR, f\"{time_index_train}/train.parquet\"))\n",
" eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f\"{time_index_eval}/valid.parquet\"))\n",
" print(train_paths)\n",
" \n",
" # Train on day related to time_index \n",
" print('*'*20)\n",
" print(\"Launch training for day %s are:\" %time_index)\n",
" print('*'*20 + '\\n')\n",
" recsys_trainer.train_dataset_or_path = train_paths\n",
" recsys_trainer.reset_lr_scheduler()\n",
" recsys_trainer.train()\n",
" recsys_trainer.state.global_step +=1\n",
" print('finished')\n",
" \n",
" # Evaluate on the following day\n",
" recsys_trainer.eval_dataset_or_path = eval_paths\n",
" train_metrics = recsys_trainer.evaluate(metric_key_prefix='eval')\n",
" print('*'*20)\n",
" print(\"Eval results for day %s are:\\t\" %time_index_eval)\n",
" print('\\n' + '*'*20 + '\\n')\n",
" for key in sorted(train_metrics.keys()):\n",
" print(\" %s = %s\" % (key, str(train_metrics[key]))) \n",
" wipe_memory()\n",
"\n",
"end = time.time()\n",
"print('Total training time:',end-start)\n"
]
},
{
"cell_type": "markdown",
"id": "8c8cb68e-a445-439f-a2ae-9c19f3669593",
"metadata": {},
"source": [
"### 2. Executing the multi-gpu training"
]
},
{
"cell_type": "markdown",
"id": "c57c0bcd-d9cb-4af5-9cc8-13e9783132d6",
"metadata": {},
"source": [
"You are now ready to execute the python script you created. Run the following shell command which will execute the script and perform data loading, model building and training using 2 GPUs.\n",
"\n",
"Note that there are four arguments you can pass to your python script, and only the first one is required:\n",
"- <b>path</b>: this argument specifies the directory in which to find the multi-day train and validation files to work on\n",
"- <b>learning rate</b>: you can experiment with different learning rates and see the effect of this hyperparameter when multiple GPUs are used in training (versus 1 GPU). Typically, increasing the learning rate (up to a certain level) as the number of GPUs is increased helps with the accuracy metrics.\n",
"- <b>per device batch size for training</b>: when using multiple GPUs in DistributedDataParallel mode, you may choose to reduce the batch size in order to keep the total batch size constant. This should help reduce training times.\n",
"- <b>per device batch size for evaluation</b>: see above"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e5678d8-71a9-465f-a986-9ec68cf351ff",
"metadata": {},
"outputs": [],
"source": [
"!python -m torch.distributed.launch --nproc_per_node 2 pyt_trainer.py --path \"./preproc_sessions_by_day\" --learning-rate 0.0005"
]
},
{
"cell_type": "markdown",
"id": "847757ca-3c91-4504-b9dc-29127563da01",
"metadata": {},
"source": [
"<b>Congratulations!!!</b> You successfully trained your model using 2 GPUs with a data-parallel approach. If you choose, you may now go back and experiment with some of the hyperparameters (eg. learning rate, batch sizes, number of GPUs) to collect information on various accuracy metrics as well as total training time, to see what fits best into your workflow."
]
},
{
"cell_type": "markdown",
"id": "68dd532f",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"id": "30f9073e",
"metadata": {},
"source": [
"- Multi-GPU data-parallel training using the Trainer class https://nvidia-merlin.github.io/Transformers4Rec/main/multi_gpu_train.html\n",
"\n",
"- Merlin Transformers4rec: https://github.com/NVIDIA-Merlin/Transformers4Rec\n",
"\n",
"- Merlin NVTabular: https://github.com/NVIDIA-Merlin/NVTabular/tree/main/nvtabular"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88a6a17a-e952-481f-8ecf-b0f9e0eb9b13",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"interpreter": {
"hash": "7b543a88d374ac88bf8df97911b380f671b13649694a5b49eb21e60fd27eb479"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}