You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of initial Kubeflow Training V2 SDK implementation, we introduced label which identifies what accelerators are used by Training Runtime: #2324 (comment).
If Training Runtime has the training.kubeflow.org/accelerator: GPU-Tesla-V100-16GB label, we add this value in the Runtime class. Additionally, we get number of CPU, GPU, or TPU devices from the TrainJob's containers and insert this value into the TrainJob's Component class.
However, it conflicts with other Kubernetes primitives (e.g. nodeSelectors, tolerations, etc.) and Kueue configurations like resources flavours.
We should discuss what is the right way to explain users available hardware resources when they are using Training Runtimes.
Data Scientists and ML Engineers should understand which accelerators are available for them while using the Training Runtimes.
In the future, we could potentially use these values to automatically assign model and data tensors to the appropriate harware devices while using the Kubeflow Training SDK.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered:
At a quick glance, I think you may need K8s primitives (nodeSelectors, tolerations) and Kueue resources like resource flavors/ClusterQueues.
If you want to make the sdk data science friendly, I tend to think that you want to specify training runtimes and that "glue" could set the k8s primitives. I also think you may need to include things like images there too.
I see that a user would say I want a AMD GPU and the admin has a series of values that will get added to their podspec to help achieve that.
What you would like to be added?
As part of initial Kubeflow Training V2 SDK implementation, we introduced label which identifies what accelerators are used by Training Runtime: #2324 (comment).
If Training Runtime has the
training.kubeflow.org/accelerator: GPU-Tesla-V100-16GB
label, we add this value in the Runtime class. Additionally, we get number of CPU, GPU, or TPU devices from the TrainJob's containers and insert this value into the TrainJob's Component class.However, it conflicts with other Kubernetes primitives (e.g. nodeSelectors, tolerations, etc.) and Kueue configurations like resources flavours.
We should discuss what is the right way to explain users available hardware resources when they are using Training Runtimes.
cc @franciscojavierarceo @kubeflow/wg-training-leads @astefanutti @Electronic-Waste @seanlaii @kannon92
Why is this needed?
Data Scientists and ML Engineers should understand which accelerators are available for them while using the Training Runtimes.
In the future, we could potentially use these values to automatically assign model and data tensors to the appropriate harware devices while using the Kubeflow Training SDK.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: