-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Make API specification more restricting #2198
Conversation
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
/assign @kubeflow/wg-training-leads |
Pull Request Test Coverage Report for Build 10307520952Details
💛 - Coveralls |
a9df00f
to
300df75
Compare
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
300df75
to
82d6fe5
Compare
cc: @mimowo |
// Namespace for the runtime. In that case, user should use TrainingRuntime | ||
Namespace *string `json:"namespace,omitempty"` | ||
// APIVersion is the apiVersion for the runtime. | ||
// Defaults to the v2alpha1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Defaults to the v2alpha1. | |
// Defaults to the kubeflow.org/v2alpha1. |
I'm wondering if we should align with the YAML apiVersion like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
APIVersion would indeed look more consistent with how objects are declared in yamls example. OTOH consider if you need a version at all - will you support other versions than the current? Maybe just APIGroup
is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Do we have plans to support different runtime version APIs in TrainJob ? e.g. v2alpha1
and v2beta1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that we need to support multiple versions of TrainingRuntimes when we graduate it from Beta to Stable so that we align with the Kubernetes deprecation policy.
But, I don't think that we need to support multiple versions when we graduate the API from alpha to beta.
So, I'm wondering if we should drop this version field in the alpha and beta stages, and then if we should revisit this discussion to mitigate the GA graduation risks.
@andreyvelich @mimowo WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, let's do it in the future versions.
What are the benefits you see to add the APIGroup in the runtime reference if that will be always kubeflow.org
?
To give users possibility to build custom APIs for the training runtimes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the benefits you see to add the APIGroup in the runtime reference if that will be always kubeflow.org ?
To give users possibility to build custom APIs for the training runtimes ?
Yes, once we provide the possibility to specify the apiGroup, vendors and users can implement the CRD instead of TrainingRuntime/ClusterTrainingRuntime and custom controller to build Jobs.
So, I'm thinking that adding the apiGroup
in the first iteration would be worth it. @andreyvelich WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Oh, I actually still have not added the apiGroup
field in the KEP.
So, based on their reactions, I opened the PR to add the apiGroup
in the API design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, you are right. Let's add it in the followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
// Namespace for the runtime. In that case, user should use TrainingRuntime | ||
Namespace *string `json:"namespace,omitempty"` | ||
// APIVersion is the apiVersion for the runtime. | ||
// Defaults to the v2alpha1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
APIVersion would indeed look more consistent with how objects are declared in yamls example. OTOH consider if you need a version at all - will you support other versions than the current? Maybe just APIGroup
is enough.
// 'kueue.x-k8s.io/multikueue'. | ||
// The built-in TrainJob controller reconciles TrainJob which don't have this | ||
// field at all or the field value is the reserved string | ||
// 'training-operator.kubeflow.org/trainjob-controller', but delegates reconciling TrainJobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
training-operator.kubeflow.org/trainjob-controller
is a bit long. What about the alternative: kubeflow.org/training-operator-v2
?
So, to match the naming of v1
we tentatively go with for v1: kubeflow.org/training-operator
(#2193)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just do kubeflow.org/training-operator
or kubeflow.org/trainjob-controller
for V2?
Since we are planning to support only managedBy: kueue.x-k8s.io/multikueue
in V1 version: #2193 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we could probably use kubeflow.org/training-operator
both for v1 and v2. It is non-ambigues since API v1 can only be reconciled by the controller in the same version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I'm ok with either name kubeflow.org/training-operator
or training-operator.kubeflow.org/trainjob-controller
because I just wanted to align with the managedBy specifications in the batch/job and jobset.
In the batch/job and jobset, we use the ${OPERATOR_NAME}.${API_GROUP}/${CONTROLLER_NAME}
like jobset.sigs.k8s.io/jobset-controller
and kubernetes.io/job-controller
.
@mimowo Do you know the reason why we selected the above managedBy format in the batch/job and jobset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I found the jobset and batch/job actually use the ${API_GROUP}/${CONTROLLER_NAME}
.
So, never mind the above my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, let's use the kubeflow.org/training-operator
as a reserved name so that we can align with batch/job and jobset.
EDIT: Actually, we reconcile the TrainJob by trainjob-controller instead of the training-opertor, and then the training-operator v2 contains multiple controllers like trainjob-controller and trainingruntime-controller.
So, I updated this section with kubeflow.org/trainjob-controller
. If you have any suggestions, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe we should use the trainjob-controller
to make it consistent with the CRD name, JobSet is doing the same: https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L46
Name string `json:"name"` | ||
|
||
// Namespace for the runtime. In that case, user should use TrainingRuntime | ||
Namespace *string `json:"namespace,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with that!
// Namespace for the runtime. In that case, user should use TrainingRuntime | ||
Namespace *string `json:"namespace,omitempty"` | ||
// APIVersion is the apiVersion for the runtime. | ||
// Defaults to the v2alpha1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Do we have plans to support different runtime version APIs in TrainJob ? e.g. v2alpha1
and v2beta1
|
||
// Kind for the runtime. TrainingRuntime or ClusterTrainingRuntime | ||
// Kind for the runtime, which must be TrainingRuntime or ClusterTrainingRuntime. | ||
// Defaults to the TrainingRuntime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make Defaults to ClusterTrainingRuntime since in OSS we will implement them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with you.
I try to refine this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Kind *string `json:"kind,omitempty"` | ||
|
||
// Version for the runtime. For example: v2alpha1 | ||
Version *string `json:"version,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to introduce Revision parameter in the future versions according to this ?
https://github.com/kubeflow/training-operator/blob/82d6fe581bacb0b4961b527f00385ece8fc1311c/docs/proposals/2170-kubeflow-training-v2/README.md#:~:text=In%20the%20future%20implementation%2C%20we%20will%20introduce%20a%20revision%20control%20mechanism%20similar%20to%20Kubernetes%20Deployment%20to%20control%20versions%20of%20TrainingRuntime%20and%20enable%20rolling%20updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I still think that the runtime revision control would be useful.
…kubeflow.org/trainjob-controller' Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @tenzen-y!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
I updated the below API definitions for the Training v2 API:
trainingRuntimeRef
namespace
field since we should use the TrainingRuntime deployed in the same namespace with the TrainJob. Otherwise, users should define the ClusterTrainingRuntime instead of the namespace-scoped TrainingRuntime resource.version
field withapiVersion
since we are sometimes confused if the version indicates the runtime revision or API version.managedBy
managedBy
field #2193. ThemanagedBy
field is acceptable only for either an empty value, "training-operator.kubeflow.org/trainjob-controller" or "kueue.x-k8s.io/multikueue".Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Related: #2170
Checklist: