[RMP] T4R - Support to multi-GPU training for binary classification and regression tasks #708

gabrielspmoreira · 2022-10-25T17:09:15Z

Problem:

Transformers4Rec supports multi-GPU training for the next-item prediction task because it uses the HF Trainer (RMP #522), which under-the-hood supports DataParallel / DistributedDataParallel.
The binary classification / regression tasks are currently not supported by HF Trainer, but rather trained with a custom model.fit() method we provided, that doesn't support DataParallel / DistributedDataParallel.

Goal:

Change the implementation of binary classification / regression tasks so that they can be trained (with multi-GPU) by using HF Trainer.

Starting Point:

[Task] Standardize the model output format. Transformers4Rec#505
Note 505 tracks the dev in 487 and 423
[Task] Support multi-GPU training for all prediction tasks Transformers4Rec#487
Binary classification - [QST] - Multi-GPU Support w/ Naive model.fit() Transformers4Rec#423 - This a customer issue that will be solved by above issues

The text was updated successfully, but these errors were encountered:

viswa-nvidia · 2022-12-20T18:18:15Z

@rnyak needs to confirm with @sararb and confirm milestone

rnyak · 2023-01-09T14:55:52Z

We created a multi-gpu example in TF4Rec for next-item prediction task. For binary-classification task we shared a multi-gpu training example PoC with the customer, and I will add code snippet into the docs in TF4Rec repo.

gabrielspmoreira added the roadmap label Oct 25, 2022

gabrielspmoreira assigned EvenOldridge, nzarif and sararb and unassigned EvenOldridge Oct 25, 2022

gabrielspmoreira added this to the Merlin 22.12 milestone Oct 25, 2022

gabrielspmoreira mentioned this issue Oct 25, 2022

[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving #522

Closed

10 tasks

gabrielspmoreira assigned rnyak and gabrielspmoreira Oct 25, 2022

sararb mentioned this issue Nov 19, 2022

Add music-streaming synthetic data to test the support of all predictions tasks with the Trainer class NVIDIA-Merlin/Transformers4Rec#540

Merged

viswa-nvidia added the 22.12 label Dec 15, 2022

viswa-nvidia modified the milestones: Merlin 22.12, Merlin 23.02 Jan 17, 2023

viswa-nvidia closed this as completed Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RMP] T4R - Support to multi-GPU training for binary classification and regression tasks #708

[RMP] T4R - Support to multi-GPU training for binary classification and regression tasks #708

gabrielspmoreira commented Oct 25, 2022 •

edited by viswa-nvidia

Loading

viswa-nvidia commented Dec 20, 2022

rnyak commented Jan 9, 2023 •

edited

Loading

[RMP] T4R - Support to multi-GPU training for binary classification and regression tasks #708

[RMP] T4R - Support to multi-GPU training for binary classification and regression tasks #708

Comments

gabrielspmoreira commented Oct 25, 2022 • edited by viswa-nvidia Loading

Problem:

Goal:

Starting Point:

viswa-nvidia commented Dec 20, 2022

rnyak commented Jan 9, 2023 • edited Loading

gabrielspmoreira commented Oct 25, 2022 •

edited by viswa-nvidia

Loading

rnyak commented Jan 9, 2023 •

edited

Loading