You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we discussed earlier, we want to design an approach to snapshot users' workspace into TrainJob (e.g. distributed ML workload): #2324 (comment).
To achieve this, we plan to generate a unique TrainJob ID before submitting it to the Kubernetes control plane.
During the KubeCon 2024 demo, we demonstrated how workspace snapshotting might work: https://youtu.be/Lgy4ir1AhYw?t=458.
In this demo, we pushed Python code files into S3 and then loaded them into TrainJob using initContainers.
However, we can consider various approaches, for instance:
Using distributed cache.
Using kubectl cp.
Why is this needed?
This should streamline Data Scientists user experience while working with Kubeflow Training Python SDK.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered:
What you would like to be added?
As we discussed earlier, we want to design an approach to snapshot users' workspace into TrainJob (e.g. distributed ML workload): #2324 (comment).
To achieve this, we plan to generate a unique TrainJob ID before submitting it to the Kubernetes control plane.
During the KubeCon 2024 demo, we demonstrated how workspace snapshotting might work: https://youtu.be/Lgy4ir1AhYw?t=458.
In this demo, we pushed Python code files into S3 and then loaded them into TrainJob using initContainers.
However, we can consider various approaches, for instance:
kubectl cp
.Why is this needed?
This should streamline Data Scientists user experience while working with Kubeflow Training Python SDK.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: