Skip to content

Commit

Permalink
Update the managedBy specifications
Browse files Browse the repository at this point in the history
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
  • Loading branch information
tenzen-y committed Aug 7, 2024
1 parent a4872f8 commit a9df00f
Showing 1 changed file with 26 additions and 0 deletions.
26 changes: 26 additions & 0 deletions docs/proposals/2170-kubeflow-training-v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,18 @@ type TrainJobSpec struct {
Suspend *bool `json:"suspend,omitempty"`

// ManagedBy is used to indicate the controller or entity that manages a TrainJob.
// The value must be either an empty, 'training-operator.kubeflow.org/trainjob-controller' or
// 'kueue.x-k8s.io/multikueue'.
// The built-in TrainJob controller reconciles TrainJob which don't have this
// field at all or the field value is the reserved string
// 'training-operator.kubeflow.org/trainjob-controller', but delegates reconciling TrainJobs
// with a 'kueue.x-k8s.io/multikueue' to the Kueue.
//
// The value must be a valid domain-prefixed path (e.g. acme.io/foo) -
// all characters before the first "/" must be a valid subdomain as defined
// by RFC 1123. All characters trailing the first "/" must be valid HTTP Path
// characters as defined by RFC 3986. The value cannot exceed 63 characters.
// The field is immutable.
ManagedBy *string `json:"managedBy,omitempty"`
}

Expand Down Expand Up @@ -1591,3 +1603,17 @@ framework that users want to run on Kubernetes.
Since frameworks share common functionality for distributed training (data parallelizm or
model parallelizm). For some specific use-cases like MPI or Elastic PyTorch, we will leverage
`MLSpec` parameter.

### Allow users to specify arbitrary value in the managedBy field

We can allow users to specify the arbitrary values instead of restricting the `.spec.managedBy` field in the TrainJob
with an empty, 'training-operator.kubeflow.org/trainjob-controller' or 'kusus.x-k8s.io/multikueue'.

But, the arbitrary values allow users to specify external or in-house customized training-operator, which means that
the TrainJobs are reconciled by the controllers without any specification compliance.

Specifically, the arbitrary training-operator could bring bugs for the status transitions.
So, we do not support the arbitrary values until we find reasonable use cases that the external controllers
need to reconcile the TrainJob.

Note that we should implement the status transitions validations to once we support the arbitrary values in the `manageBy` field.

0 comments on commit a9df00f

Please sign in to comment.