Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RMP] T4R - Support to multi-GPU training for binary classification and regression tasks #708

Closed
2 of 3 tasks
gabrielspmoreira opened this issue Oct 25, 2022 · 2 comments
Closed
2 of 3 tasks

Comments

@gabrielspmoreira
Copy link
Member

gabrielspmoreira commented Oct 25, 2022

Problem:

Transformers4Rec supports multi-GPU training for the next-item prediction task because it uses the HF Trainer (RMP #522), which under-the-hood supports DataParallel / DistributedDataParallel.
The binary classification / regression tasks are currently not supported by HF Trainer, but rather trained with a custom model.fit() method we provided, that doesn't support DataParallel / DistributedDataParallel.

Goal:

  • Change the implementation of binary classification / regression tasks so that they can be trained (with multi-GPU) by using HF Trainer.

Starting Point:

@viswa-nvidia
Copy link

@rnyak needs to confirm with @sararb and confirm milestone

@rnyak
Copy link
Contributor

rnyak commented Jan 9, 2023

We created a multi-gpu example in TF4Rec for next-item prediction task. For binary-classification task we shared a multi-gpu training example PoC with the customer, and I will add code snippet into the docs in TF4Rec repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants