KEP-2170: Add AMD ROCm Torch Distributed Training Runtime #2335

astefanutti · 2024-11-26T15:25:24Z

Support ROCm PyTorch distributed training runtime.

PyTorch has been advertising support for AMD ROCm and AMD Instinct and Radeon GPUs since version 2.0.

The latest generation AMD Instinct accelerators such as the MI300X makes it possible to run state-of-the-art large scale training jobs as demonstrated in https://developers.redhat.com/articles/2024/10/03/amd-gpus-model-training-openshift-ai.

It would be great to bring ROCm Torch distributed training runtime along with the NVIDIA one brought by #2328.

Generally that would be useful to define how to manage support for multiple accelerators among the training runtimes.

Give it a 👍 We prioritize the features with most 👍

andreyvelich · 2024-11-26T15:58:05Z

Thanks for creating this @astefanutti!
/remove-label lifecycle/needs-triage
/area runtime

astefanutti added kind/feature lifecycle/needs-triage labels Nov 26, 2024

astefanutti mentioned this issue Nov 26, 2024

KEP-2170: Add Torch Distributed Runtime #2328

Merged

google-oss-prow bot added area/runtime and removed lifecycle/needs-triage labels Nov 26, 2024

Provide feedback