[SDK] Show available Runtime accelerators to users #2355

andreyvelich · 2024-12-13T18:44:29Z

What you would like to be added?

As part of initial Kubeflow Training V2 SDK implementation, we introduced label which identifies what accelerators are used by Training Runtime: #2324 (comment).

If Training Runtime has the training.kubeflow.org/accelerator: GPU-Tesla-V100-16GB label, we add this value in the Runtime class. Additionally, we get number of CPU, GPU, or TPU devices from the TrainJob's containers and insert this value into the TrainJob's Component class.

However, it conflicts with other Kubernetes primitives (e.g. nodeSelectors, tolerations, etc.) and Kueue configurations like resources flavours.

We should discuss what is the right way to explain users available hardware resources when they are using Training Runtimes.

cc @franciscojavierarceo @kubeflow/wg-training-leads @astefanutti @Electronic-Waste @seanlaii @kannon92

Why is this needed?

Data Scientists and ML Engineers should understand which accelerators are available for them while using the Training Runtimes.

In the future, we could potentially use these values to automatically assign model and data tensors to the appropriate harware devices while using the Kubeflow Training SDK.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

The text was updated successfully, but these errors were encountered:

kannon92 · 2024-12-13T18:58:34Z

At a quick glance, I think you may need K8s primitives (nodeSelectors, tolerations) and Kueue resources like resource flavors/ClusterQueues.

If you want to make the sdk data science friendly, I tend to think that you want to specify training runtimes and that "glue" could set the k8s primitives. I also think you may need to include things like images there too.

I see that a user would say I want a AMD GPU and the admin has a series of values that will get added to their podspec to help achieve that.

andreyvelich added kind/feature kind/discussion area/sdk labels Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDK] Show available Runtime accelerators to users #2355

[SDK] Show available Runtime accelerators to users #2355

andreyvelich commented Dec 13, 2024

kannon92 commented Dec 13, 2024

[SDK] Show available Runtime accelerators to users #2355

[SDK] Show available Runtime accelerators to users #2355

Comments

andreyvelich commented Dec 13, 2024

What you would like to be added?

Why is this needed?

Love this feature?

kannon92 commented Dec 13, 2024