[feature] Reconsider the design of Trial Template #906

gaocegege · 2019-10-31T06:23:46Z

/kind feature

Describe the solution you'd like
[A clear and concise description of what you want to happen.]

We need to marshal the TFJob to JSON string then use it to create experiments if we are using K8s client-go. It is not good. And, go template is ugly, too.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

gaocegege · 2019-11-12T03:35:05Z

/priority p0

/cc @johnugeorge @hougangliu @richardsliu

We can discuss here.

hougangliu · 2019-11-12T06:53:26Z

based on the proposal #857 (comment), I refactor struct as below:

type TrialTemplate struct {
	Retain     bool        `json:"retain,omitempty"`
	Template *Template `json:"template,omitempty"`
}

type ConfigmapTemplate struct {
	ConfigMapName      string `json:"configMapName,omitempty"`
	ConfigMapNamespace string `json:"configMapNamespace,omitempty"`
	Path       string `json:"path,omitempty"`
}

type InjectParametersKind string
const (
	GoTemplate     InjectParametersKind = "goTemplate" // By Default
	EnvVar      InjectParametersKind = "envVar"
	Args         InjectParametersKind = "args"
)

type Template struct {
	Kind InjectParametersKind `json:"kind,omitempty"`
	ConfigmapTemplate *ConfigmapTemplate `json:"configmapTemplate,omitempty"`
	RawTemplate  string        `json:"rawTemplate,omitempty"`
}

gaocegege · 2019-11-21T11:09:31Z

Should we support structured template? E.g. TFJob and PyTorchJob.

richardsliu · 2019-11-21T18:21:19Z

@hougangliu Can you add examples of what a template would look like?

hougangliu · 2019-11-22T00:47:59Z

Here list the goTemplate, envVar, and args kind

goTemplate: keep same as current behavior, the template from ConfigMap or rawTemplate will be handled in go template way by making hyperParameters set fill .HyperParameters placeholder.

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
 ...
spec:
  ...
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"
    - name: --num-layers
      parameterType: int
      feasibleSpace:
        min: "2"
        max: "5"
    - name: --optimizer
      parameterType: categorical
      feasibleSpace:
        list:
        - sgd
        - adam
        - ftrl
  trialTemplate:
    template:
        kind: "goTemplate"
        rawTemplate: |-
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            template:
              spec:
                containers:
                - name: {{.Trial}}
                  image: docker.io/katib/mxnet-mnist-example
                  command:
                  - "python"
                  - "/mxnet/example/image-classification/train_mnist.py"
                  - "--batch-size=64"
                  {{- with .HyperParameters}}
                  {{- range .}}
                  - "{{.Name}}={{.Value}}"
                  {{- end}}
                  {{- end}}
                restartPolicy: Never

envVar: hyperParameters set will be append into env of containers in the template from ConfigMap or rawTemplate; this kind is used in the case that some model reads hyperparameters by ENV. in below example, it is expected that /mxnet/example/image-classification/train_mnist.py in docker.io/katib/mxnet-mnist-example reads lr, num-layers and optimizer env variable as its hyperparameters.

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
 ...
spec:
  ...
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"
    - name: num-layers
      parameterType: int
      feasibleSpace:
        min: "2"
        max: "5"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
        - sgd
        - adam
        - ftrl
  trialTemplate:
    template:
        kind: "envVar"
        rawTemplate: |-
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            template:
              spec:
                containers:
                - name: {{.Trial}}
                  image: docker.io/katib/mxnet-mnist-example
                  command:
                  - "python"
                  - "/mxnet/example/image-classification/train_mnist.py"
                  - "--batch-size=64"
                restartPolicy: Never

args: hyperParameters set will be append into args of containers in the template from ConfigMap or rawTemplate.

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
 ...
spec:
  ...
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"
    - name: --num-layers
      parameterType: int
      feasibleSpace:
        min: "2"
        max: "5"
    - name: --optimizer
      parameterType: categorical
      feasibleSpace:
        list:
        - sgd
        - adam
        - ftrl
  trialTemplate:
    template:
        kind: "goTemplate"
        rawTemplate: |-
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            template:
              spec:
                containers:
                - name: {{.Trial}}
                  image: docker.io/katib/mxnet-mnist-example
                  command:
                  - "python"
                  - "/mxnet/example/image-classification/train_mnist.py"
                  args:
                  - "--batch-size=64"
                restartPolicy: Never

In summary, gotemplate kind is more extensible, it can cover the other two ways; the only shortcoming is that some users feel hard to work with programmatically

richardsliu · 2019-11-22T23:41:27Z

For Job-type trials, do we actually need anything more than "image" and "command"?

trialTemplate:
    template:
        kind: "goTemplate"
        rawTemplate: |-
          apiVersion: batch/v1     
          kind: Job
          metadata:
            name: {{.Trial}}         <--- This is auto-generated by the controller anyway
            namespace: {{.NameSpace}}   <-- This should be the same as experiment namespace
          spec:
            template:
              spec:
                containers:
                - name: {{.Trial}}       <-- Auto-generated
                  image: docker.io/katib/mxnet-mnist-example
                  command:
                  - "python"
                  - "/mxnet/example/image-classification/train_mnist.py"
                  - "--batch-size=64"
                  {{- with .HyperParameters}}         <-- All of this is unnecessary, just a placeholder
                  {{- range .}}
                  - "{{.Name}}={{.Value}}"
                  {{- end}}
                  {{- end}}
                restartPolicy: Never

So I think we can cut all of this down to just:

trialSpec:
    kind: Job
    image: ....
    command: ....

For distributed jobs it can be more complicated, but still most of the spec can be auto-generated:

trialSpec:
    kind: TFJob
    image: ....
    command: .....
    clusterSpec:
        worker: 4
        ps: 2

What do you think?

gaocegege · 2019-11-25T01:30:27Z

Prefer not to have our own spec with image and command. If users want to set resource requirements, our abstraction will limit this scenerio.

johnugeorge · 2019-11-25T06:41:37Z

@richardsliu I was thinking the same way earlier that you suggested. But the biggest problem that I see is that Katib will need to have framework specific logic to generate the corresponding job type. Eg: Generate TFJob/PytorchJob template

richardsliu · 2019-11-26T23:17:01Z

@johnugeorge Then can we do something like:

Allow importing Job/TFJob/PytorchJob yaml files directly (instead of using GoTemplates)
Use Python AST or some other tool to replace the hyperparameters (perhaps from another file or configmap)
Experiment controller can overwrite other metadata like name, namespace, etc.

Would this work? I think this would address @jlewi 's concerns in #857.

johnugeorge · 2019-11-27T03:33:00Z

@richardsliu Can you explain bit more on 1)

richardsliu · 2019-11-27T04:51:57Z

I was thinking along the lines of this answer: https://stackoverflow.com/a/9577670

Suppose we could somehow include the path to an existing Job/TFJob/PytorchJob definition (in valid YAML) as part of the trial template. We can then add another configmap for the hyperparameters to substitute. This means we can preserve all the fields in the original YAML templates, without cluttering the experiment definition. We can also generate the full experiment YAML programmatically.

jlewi · 2019-12-05T07:02:04Z

Tekton might be a good reference point for how to store templates and parameterize them.

gaocegege · 2019-12-05T07:06:41Z

@jlewi Thanks for the suggestion. Do you mean this way?

spec:
  resources:
    - name: source-repo
      resourceSpec:
        type: git
        params:
          - name: revision
            value: v0.32.0
          - name: url
            value: https://github.com/GoogleContainerTools/skaffold
    - name: web-image
      resourceSpec:
        type: image
        params:
          - name: url
            value: gcr.io/christiewilson-catfactory/leeroy-web
    - name: app-image
      resourceSpec:
        type: image
        params:
          - name: url
            value: gcr.io/christiewilson-catfactory/leeroy-app

jlewi · 2019-12-13T18:50:56Z

@gaocegege yes; To be clear I'm suggesting we take that as inspiration not copy it exactly.

gaocegege · 2019-12-16T01:32:12Z

Gotcha. Thanks

andreyvelich · 2020-01-17T19:15:06Z

To continue what @richardsliu proposed, my suggestion is this.

From Job, TFJob or PytorchJob we just need Spec field, where user can specify containers, resources etc., what @gaocegege mentioned above.
So we can have all supported specification in TrialTemplate.

The API can be like this:

type TrialTemplate struct {
	Retain         bool                     `json:"retain,omitempty"`
	JobKind        string                   `json:"jobKind,omitempty"`
	JobSpec        v1.JobSpec               `json:"jobSpec,omitempty"`
	TFJobSpec      tfv1.TFJobSpec           `json:"tfJobSpec,omitempty"`
	PytorchJobSpec pytorchv1.PyTorchJobSpec `json:"pytorchJobSpec,omitempty"`
}

JobKind will be either Job, TFJob or PytorchJob. And we can run corresponding Job depend on JobKind data.
User can specify all require parameters in JobSpec and we can add rest parameters in the Katib controller, such as HyperParameters, metadata, containers name, etc.
What do you think @gaocegege @johnugeorge @hougangliu ?

gaocegege · 2020-01-19T01:29:41Z

LGTM but prefer a pointer here.

type TrialTemplate struct {
	Retain         bool                     `json:"retain,omitempty"`
	JobKind        string                   `json:"jobKind,omitempty"`
	JobSpec        *v1.JobSpec               `json:"jobSpec,omitempty"`
	TFJobSpec      *tfv1.TFJobSpec           `json:"tfJobSpec,omitempty"`
	PytorchJobSpec *pytorchv1.PyTorchJobSpec `json:"pytorchJobSpec,omitempty"`
}

johnugeorge · 2020-01-19T04:07:54Z

One problem I have is that we would need katib api upgrade for any new Spec support.

gaocegege · 2020-01-19T04:10:22Z

API is compatible since we only add some new kinds. I think the shortcoming is acceptable.

andreyvelich · 2020-01-20T14:46:04Z

@gaocegege @johnugeorge Yes, we can update TrialTemplate with the new fields and remove or keep GoTemplate in the future version. That will not brake current Katib jobs.
The API can be something like that:

type TrialTemplate struct {
	Retain         bool                      `json:"retain,omitempty"`
	GoTemplate     *GoTemplate               `json:"goTemplate,omitempty"`
	JobKind        string                    `json:"jobKind,omitempty"`
	JobSpec        *v1.JobSpec               `json:"jobSpec,omitempty"`
	TFJobSpec      *tfv1.TFJobSpec           `json:"tfJobSpec,omitempty"`
	PytorchJobSpec *pytorchv1.PyTorchJobSpec `json:"pytorchJobSpec,omitempty"`
}

For me, personally, it is better to change TrialTemplate to TrialSpec in the future, since it will be Specification and not Template more.
I am not sure if this change will be fine for us. This will affect more changes in the controller.

gaocegege · 2020-01-21T01:34:35Z

I am not sure if we should make it a spec. Since the command in the spec is still a template like:

                      command:
                        - "python"
                        - "/katib-ps-worker.py"
                        - "--train_steps"
                        - "5000"
                        {{- with .HyperParameters}}
                        {{- range .}}
                        - "{{.Name}}={{.Value}}"
                        {{- end}}
                        {{- end}}

andreyvelich · 2020-01-21T17:08:34Z

@gaocegege After we will implement JobSpec you think we still need to add Go Template in the command?
Or we can add these parameters by default in the Controller and user should specify only basic command in the yaml file:

command:
  - "python"
  - "/katib-ps-worker.py"
  - "--train_steps"
  - "5000"

hougangliu · 2020-01-22T02:05:38Z

we'd better not restrict the hyperparameters can only be passed to the job by args. in real use case, user is likely to get them from environment.

andreyvelich · 2020-01-22T11:44:22Z

we'd better not restrict the hyperparameters can only be passed to the job by args. in real use case, user is likely to get them from environment.

We can add additional field ParametersKind, like JobKind.
Where user can define if parameters must be passing to the job by args or from env.

gaocegege · 2020-04-01T10:41:16Z

TODO: Share the design of Tekton

/assign @gaocegege

jlewi · 2020-04-05T22:38:08Z

No I think we do not use Tekton, we just learn how to design CRD from it.

Shouldn't the question be whether a Tekton task is the right solution? If the API isn't right then I think it makes sense to design a Katib specific resource but I don't see why we would take on the cost of building another CR if one already exists.

Backing up a step. What is the problem this issue is trying to solve? You are trying to define a template resource that would allow Katib to stamp out new resources with specific inputs and obtain the outputs.

One of the key design questions appears to be how to handle special resources (e.g. TFJob, PyTorch, etc...).

All the proposals in this thread appear to go with a design where TrialTemplate is a union of supported types. This doesn't seem very scalable.

Furthermore, how does this solve the problem of reporting outputs back to Katib?

An alternative approach would be you support a single resource (like Tekton task) that just runs a container that must conform to a specific API; e.g. all outputs must be written to "/outputs/results.yaml".

You can now support any resource (e.g. PyTorch, TFJob, Notebook, etc...) just by writing a Task/container that would act as a bridge. i.e. submit the resource wait for it to complete and then extract metrics and report in format required for Katib.

This is extensible. Anyone can write a Task and publish it without requiring any changes to Katib. This allows Katib to focus on orchestrating the hyperparameter tuning and not worrying about supporting individual resources.

Reusing an existing resource (e.g. Tekton Task) as opposed to introducing a new resource means that Katib can leverage and reuse Tekton tasks created for other purposes and not force people to write new Tekton tasks.

As an example, for our E2E tests we are creating Tekton task to run notebooks. So if Katib supports a Tekton task, then a user could reuse this Tekton task to easily add hyperparameter tuning for any notebook.

Similarly, if we define a library of Tekton tasks to run TFJob, PyTorch, etc... then those could be reused with Tekton.

Currently, from the Job we take only first container as training job:
Why? Isn't this brittle?

In that case, user will have to install additional CRD while deploying Katib.

Katib is already installing multiple CRDs and resources. What difference does it make whether some of these resources are created as part of Katib or come from other projects.

gaocegege · 2020-04-07T02:06:46Z

@jlewi Thanks for your comment. It is helpful. I will reply after a deep dive.

czheng94 · 2020-04-21T13:49:08Z

My thoughts after reading through @jlewi's comment:

Backing up a step. What is the problem this issue is trying to solve? You are trying to define a template resource that would allow Katib to stamp out new resources with specific inputs and obtain the outputs.

According to my understanding, we are designing a templated subresource that has these requirements:

Input-controlled: Created by Trial with Hyperparameter assignments as input
Output-retrieved: Managed by Trial to retrieve Metrics as output
Status-monitored: Managed by Trial to determine Trial status (Running, Succeeded or Failed) from whatever the subresource status is

And our goals includes:

Make this subresource generic: In Katib, we should not need to change how we handle input, output and status of this subresource for PytorchJob, TFJob, batch Job, etc.
Minimize the user effort to define this subresource
Minimize possibility of user error when defining this subresource

From the discussion above, it seems like our options are:

Use a goTemplate to define an arbitrary subresource.
Enumerate subresource types we want to support, so that their definitions can be strongly-typed in the experiment CRD.
Support a single generic resource that conforms to a generic API, and allow user to freely define how to inject input, report status and emit outputs.

Let’s look at these options one by one

Option 1: Use a goTemplate to define arbitrary subresource

This is what Katib is currently doing. Let’s see how it is addressing the requirements above: 

Input-controlled: It allow users to use goTemplates to define completely arbitrary subresource, and use placeholders to inject inputs.
Output-retrieved: Metrics collectors are directly injected into the subresource pods to retrieve arbitrary outputs from sub resource pods.
Status-monitored: It assumes all subresource’s status.conditions conforms to that of batch Jobs. (https://bbgithub.dev.bloomberg.com/czheng94/katib/blob/2f29e65ad07aff76f7f971b20c8b5a6ab9065eb7/pkg/job/v1alpha3/job.go#L41-L60). Personally I think this is bad and should be more extensible.

And in terms of the goals:

This subresource definition is generic, but handling of the status is hard to be made generic, because you will have different status types for different kinds.
Users only need to copy an existing "Job" and modify argument lists to use jinja template placeholders
A lot of user errors can happen in the arbitrary jobtemplate because it's not typed at all.

Option 2: Enumerate subresource types we want to support, so that their definitions can be strongly-typed in the experiment CRD

For subresource requirements:

Input-controlled: Since we here want strong typing of these subresource definitions, you can't use template placeholders to inject inputs anymore. Ideas proposed above include:
- Mutate the subresource's env list
- Mutate the subresource's args list
  I would argue that these methods are not explicit enough in terms of how the parameter assignments are applied, i.e. by inspecting the Experiment CR, it's hard to figure out how my parameters are injected. As a user, I need to learn that "Oh my hyperparameters will be appended to my argument list", then "but what if I have existing parameters in the template arg list that collides with one of my hyperparameters?"
Output-retrieved: Can work in the same way as option 1 using metrics collectors
Status-monitored: Since all the permitted subresource types are enumerated, you can implement status monitoring for each of them.

And in terms of the goals:

This subresource definition is not generic. You won't be able to support user's in-house job type.
User effort to define this subresource is minimized. Users can copy an existing subresource or even won't be too hard to start from scratch since it's strongly-typed.
User error is also minimized since the subresource is strongly-typed now.

Option 3: Support a single generic resource that conforms to a generic API, and allow user to freely define how to inject input, report status and emit outputs

Here, this subresource (can be a Tekton Task) is a wrapper around the actual "Job" CRD that will be created and monitored. This subresource can be strongly-typed to define the input hyperparameters, output metrics and status. Users can define how to handle input, output and monitoring in the subresources's custom image.

For subresource requirements:

Input-controlled: handled in the user-defined subresource image
Output-retrieved: handled in the user-defined subresource image
Status-monitored: handled in the user-defined subresource image

In terms of the goals:

This subresource is generic. User's in-house job type is supported by allowing them to custom-define the subresource image.
There are considerable amount of user effort to define this subresource. You need to write the control and monitoring logic for your "Job" and package them up in an image.
Once the subresource is established, the user error is minimal. Most user errors are ported away from Katib.

Some more thoughts in support of option 3

Actually I always believe hypertuning should act on an established resource (e.g. a registered Tekton Task), instead of defining them from scratch when starting an Experiment. Users will always try to implement and test the child job (e.g. PytorchJob), make sure they are satisfied with its behavior before they tell Katib "OK I'm ready to hypertune this".

Moreover, this established resource can be reused by other hypertune Experiments, we can even factor in a tighter definition of its I/O model, using ParameterSpec to declare the input parameter type, range and scale, etc, and use those to construct default ParameterSpec in Experiments.

With all above, I think we should support both option 2 for the ease of hypertuning known KubeFlow jobs, and option 3 for extensibility. The trial template will look like below

type TrialTemplate struct {
	PytorchJob *pytorchjobv1.PyTorchJobSpec `json:"pytorchJob,omitempty"`
	TfJob	   *tfjobv1.TFJobSpec `json:"tfJob,omitempty"`
	Job		   *batchv1.JobSpec `json:"job,omitempty"`
	// HyperTunable is a generic abstract job that controls and monitors arbitrary subresources
	HyperTunable *HyperTunableSpec `json:"hyperTunable,omitempty"`
}

type HyperTunableSpec struct {
	// User are expected to declare all hyperparameters here. 
    // Each spec can be as simple as only containing a hyperparameter name, or it can fully define a hyperParameter type
	ParameterSpecs []ParameterSpec `json:"parameterSpecs"`
	// Trial controller will append all parameter assignments here. 
    // Users are expected to inject these hyperparameters into their own jobs. 
    // ParameterAssignments should match ParameterSpecs.
	ParameterAssignments []common.ParameterAssignment `json:"parameterAssignments"`
	// Users are expected to retrieve metrics from their own jobs and write them to files
	Metrics []MetricFile `json:"metrics,omitempty"`
	// Users are expected to implement how to create, control and monitor their own jobs in this image
	Image string `json:"image,omitempty"`
}

type MetricFile struct {
	Name string `json:"name,omitempty"`
	Path string `json:"path,omitempty"`
}

type HyperTunableStatus struct {
	StartTime *metav1.Time `json:"startTime,omitempty"`
	CompletionTime *metav1.Time `json:"completionTime,omitempty"`
	LastReconcileTime *metav1.Time `json:"lastReconcileTime,omitempty"`
	Conditions []HyperTunableCondition `json:"conditions,omitempty"`
}

type HyperTunableCondition struct {
	Type HyperTunableConditionType `json:"type"`
	Status v1.ConditionStatus `json:"status"`
	Reason string `json:"reason,omitempty"`
	Message string `json:"message,omitempty"`
	LastUpdateTime metav1.Time `json:"lastUpdateTime,omitempty"`
	LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty"`
}

type HyperTunableConditionType string

const (
	HyperTunableRunning HyperTunableConditionType = "Running"
	HyperTunableFailed  HyperTunableConditionType = "Failed"
	HyperTunableSucceeded HyperTunableConditionType = "Succeeded"
)

type HyperTunable struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec HyperTunableSpec `json:"spec,omitempty"`
	Status HyperTunableStatus `json:"status,omitempty"`
}

gaocegege · 2020-04-27T02:06:32Z

@czheng94 Thanks for your comment.

I have one question about your proposal: Where does the user define the trial job? Is it in TrialTemplate.Tfjob?

czheng94 · 2020-04-27T14:51:55Z

@gaocegege yes.
The advantage of that is you will have strongly-typed job specs to make user's life easier.
And if we are doing that, we should also make those template choices mutually exclusive ( between TrialTemplate.TfJob and TrialTemplate.PytorchJob, etc.), maybe with a CR validation webhook.

And TrialTemplate.HyperTunable for arbitrary subresource management also aligns with this goal in 2020 roadmap: https://github.com/kubeflow/katib/blame/master/ROADMAP.md#L25

andreyvelich · 2020-05-06T21:56:56Z

After discussion about Trial template we made few suggestions.

Using Tekton might be a problem in Katib. For Tekton tasks we still need to define Specification for different jobs (TFJob, MPIJob, PytorchJob), since they have different API structure. Also, using Tekton tasks as Trial Template would need huge Katib controller changes. Another point, Tekton is not part of Kubeflow project, which make harder to follow Tekton API changes, etc.

We can implement each job independently which was mentioned above. KF serving also follow this way: https://github.com/kubeflow/kfserving/blob/master/pkg/apis/serving/v1alpha2/inferenceservice_types.go#L93. This make user life much easier. User just needs to know how to define TF/Pytorch job and to tell which parameters need to be tuned.

Define ParametersInjectPolicy to make possible create/add parameters to env variables or args. Not sure if we want to add parameters to command. Currently, we add all commands to args: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1alpha3/pod/inject_webhook.go#L322.

Define ParametersFormat where user can write regex how to send the parameters.
For example --lr=0.01 or --lr 0.01. This can be done only for ParametersInjectPolicy=args. Close to current Metrics Collector metricsFormat.
As well, we will leave functionality to write Trial Template using GoTemplate.

The structure can be something like this:

type TrialTemplate struct {
	// Retain saves Trial job in completed state (e.g read logs)
	Retain                 bool                       `json:"retain,omitempty"`
	GoTemplate             *GoTemplate                `json:"goTemplate,omitempty"`
	ParametersInjectPolicy ParametersInjectPolicyType `json:"parametersInjectPolicy,omitempty"`
	ParametersFormat       string                     `json:"parametersFormat,omitempty"`
	JobSpec                *v1.JobSpec                `json:"jobSpec,omitempty"`
	TFJobSpec              *tfv1.TFJobSpec            `json:"tfJobSpec,omitempty"`
	PytorchJobSpec         *pytorchv1.PyTorchJobSpec  `json:"pytorchJobSpec,omitempty"`
	MPIJobSpec             *mpiv1.MPIJobSpec          `json:"mpiJobSpec,omitempty"`
}

type ParametersInjectPolicyType string

const (
	ArgInjectPolicy ParametersInjectPolicyType = "args"
	EnvInjectPolicy ParametersInjectPolicyType = "env"
)

Also, do we want to support reading template from ConfigMap ?

@johnugeorge @gaocegege Please add if I missed something.
What do you think @jlewi ?

issue-label-bot · 2020-05-06T21:57:03Z

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

gaocegege · 2020-05-07T02:40:14Z

LGTM. I think we can come to the agreement.

jlewi · 2020-05-07T03:33:45Z

Thanks. Some questions (not blocking; please consider me just a bystander).

Why have a single TrialTemplate resource with a union of types as opposed to multiple CR's all adhering to the same model? e.g.
- type TFJobTemplate {}
- type PyTorchTemplate{}
Why do we need ParametersInjectPolicy? Can we follow what tekton does and then just look for strings $(parameter.name) in various places in the spec (environment variables, arguments, image, etc...) and do substitution?

Should we make the definition of the parameters part of the template e.g.

TfJobTrialTemplate  {
       spec:
          params:
             - name: num_layers
                type: int
                description: The number of layers
                default: 5
          tfjobspec:
              - args:
                - --num_layers=$(params.num_layers)
 }

In other words have you thought about using duck typing to make it more extensible? i.e. does Katib need to know about each type of resource or could it be made to use any resource that adheres to some resource model (e.g. has fields like params)

andreyvelich · 2020-05-07T15:51:49Z

Thanks. Some questions (not blocking; please consider me just a bystander).

Why have a single TrialTemplate resource with a union of types as opposed to multiple CR's all adhering to the same model? e.g.

type TFJobTemplate {}

type PyTorchTemplate{}

That because operators specification have different structure. For example, TF Operator (https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/tensorflow/v1/types.go#L47) and MPI Operator (https://github.com/kubeflow/mpi-operator/blob/master/pkg/apis/kubeflow/v1/types.go#L40). I think we want to support all operators functionalities in Trial Template.

Why do we need ParametersInjectPolicy? Can we follow what tekton does and then just look for strings $(parameter.name) in various places in the spec (environment variables, arguments, image, etc...) and do substitution?
Should we make the definition of the parameters part of the template e.g.
TfJobTrialTemplate  {
       spec:
          params:
             - name: num_layers
                type: int
                description: The number of layers
                default: 5
          tfjobspec:
              - args:
                - --num_layers=$(params.num_layers)
 }

In that case we should again convert tfjobspec to GoTemplate and make appropriate substitution? What do you think about it @johnugeorge @gaocegege ?

jlewi · 2020-05-07T16:17:49Z

That because operators specification have different structure. For example, TF Operator

So there are two ways to support this

Have a single Trial resource that is a union of all operator types
Have multiple Trial resource types; 1 for each operator
- Your experiment could then be a reference to the specific resource
  - It looks like you are currently inlining trialTemplate into Experiment (https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/tfjob-example.yaml)
- The experiment controller could then use duck typing so it doesn't need to have logic to support every specific type of trial; instead it can work with any resource that conforms to certain patterns
- You could use a single controller or multiple controllers to handle the initial set of CRDs (I think using a single go binary to support multiple CRD's is common).

I think the main advantage of having multiple Trial resources is that its more scalable in terms of code ownership and decision making.

For example suppose I think Katib should have first class support for Tekton Tasks or Argo workflows but the Katib team thinks its not right or doesn't have the cycles to implement it or even review the code contributions.

In the multiple CR world I can go and implement TektonTrialTemplate and ArgoTrialTemplate and as long it conforms to the Katib Trial Template resource model (e.g. has spec.params and respects certain conditions) then ideally it should just work with Katib.

It looks like the issue might be today you are inlining trialTemplate into experiment example.

How about adding support for references. e.g

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
trialTemplateRef:
   - apiVersion: tekton.dev/v1alpha1
     kind: Pipeline
     name: my-pipeline

You would need an extensible mechanism(s) to convert templates to actual instances. You could bake in support for the predefined types you want to support (TFJob, MPI, etc...).

To make it extensible you do something like use an annotation to point at a lambda func(params, template) that returns a hydrated K8s object.

andreyvelich · 2020-05-07T16:43:45Z

You would need an extensible mechanism(s) to convert templates to actual instances. You could bake in support for the predefined types you want to support (TFJob, MPI, etc...).

To make it extensible you do something like use an annotation to point at a lambda func(params, template) that returns a hydrated K8s object.

Is it common in Kubernetes to use API resource as lambda function to support any kind of Operators/Jobs which follow the way of Katib Trial Template?
Can you give an example, how API will be in that case?

It looks like the issue might be today you are inlining trialTemplate into experiment.

Yes, because Trial Template is the part of Experiment API.

gaocegege · 2020-05-08T01:45:14Z

That because operators specification have different structure. For example, TF Operator

So there are two ways to support this
1. Have a single Trial resource that is a union of all operator types

2. Have multiple Trial resource types; 1 for each operator
   
   * Your experiment could then be a reference to the specific resource
     
     * It looks like you are currently inlining trialTemplate into Experiment (https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/tfjob-example.yaml)
   * The experiment controller could then use duck typing so it doesn't need to have logic to support every specific type of trial; instead it can work with any resource that conforms to certain patterns
   * You could use a single controller or multiple controllers to handle the initial set of CRDs (I think using a single go binary to support multiple CRD's is common).
I think the main advantage of having multiple Trial resources is that its more scalable in terms of code ownership and decision making.

For example suppose I think Katib should have first class support for Tekton Tasks or Argo workflows but the Katib team thinks its not right or doesn't have the cycles to implement it or even review the code contributions.

In the multiple CR world I can go and implement TektonTrialTemplate and ArgoTrialTemplate and as long it conforms to the Katib Trial Template resource model (e.g. has spec.params and respects certain conditions) then ideally it should just work with Katib.

It looks like the issue might be today you are inlining trialTemplate into experiment example.

How about adding support for references. e.g
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
trialTemplateRef:
   - apiVersion: tekton.dev/v1alpha1
     kind: Pipeline
     name: my-pipeline
You would need an extensible mechanism(s) to convert templates to actual instances. You could bake in support for the predefined types you want to support (TFJob, MPI, etc...).

To make it extensible you do something like use an annotation to point at a lambda func(params, template) that returns a hydrated K8s object.

In this case, I am not sure how to define the spec of the TFJob.

jlewi · 2020-05-08T14:47:12Z

Is it common in Kubernetes to use API resource as lambda function to support any kind of Operators/Jobs which follow the way of Katib Trial Template?

I think duck typing is a common pattern. Here's a doc from knative. I think the problem statement is a really good explanation of why it might be valuable for Katib.

In particular, I would like to promote loose coupling between Katib and other bits of Kubeflow (e.g. the Job operators).

I'm not aware of anyone using Lambdas. The model Tekton follows is for each resource to have 2 distinct resources:

A Template (Task, Pipeline)
A Run (TaskRun, PipelineRun) - These are instantiations of the template

We could follow the same pattern; e.g.

TfJobTrialTemplate
TFJobTrialRun

So Katib would create TFJobTrialRuns from TFJobTrialTemplate. This CR would in turn create TFJobs to execute the actual jobs and collect the status and outputs.

In this case, I am not sure how to define the spec of the TFJob
The TFJobSpec would be embedded in the spec of TFJobTrialTemplate. This would be the usual spec and would be typed but could support using the $(params.) notation to allow substitution of template parameters into various locations in the spec.

apiversion: kubeflow.org
kind: tfjobtrialtemplate
spec:
   params:
       - name: numlayers
          type: int
   tfjobspec:
       metadata: ...
       spec:
           tfReplicaSpecs: 
              ps: ...
              worker: ....

andreyvelich · 2020-05-08T16:52:43Z

We could follow the same pattern; e.g.

TfJobTrialTemplate

TFJobTrialRun

So Katib would create TFJobTrialRuns from TFJobTrialTemplate. This CR would in turn create TFJobs to execute the actual jobs and collect the status and outputs.

What is the advantages for Katib to create separate resource for each job, e.g TFJobTrialRun for TFOperator and PytorchJobTrialRun for PytorchOperator?

Trial in Katib has unique structure and difference between various operators only in RunSpec: https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/trials/v1alpha3/trial_types.go#L35.

jlewi · 2020-05-08T21:04:11Z

What is the advantages for Katib to create separate resource for each job, e.g TFJobTrialRun for TFOperator and PytorchJobTrialRun for PytorchOperator?

The advantage is to make Katib loosely coupled with the individual operators. Right now it is not loosely decoupled. i.e. Katib needs to be explicitly updated to add support for each type of resource.

Trial in Katib has unique structure and difference between various operators only in RunSpec:

I'm not familiar with the specifics of your API to know where you separate things out so maybe its not Trial maybe its Runspec. I'm a little confused because I don't see Experiment in trial_types.go and I thought it was the experiment and not the trial that specified the trial and the search space.

jlewi · 2020-05-08T21:09:56Z

So looking at the current example:
https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/random-example.yaml

It looks like the custom resource is Experiment which embeds TrialTemplate. If we we follow that, I would suggest changing it to something like

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  labels:
    controller-tools.k8s.io: "1.0"
  name: random-example
spec:
  objective: ...
  algorith: ...
  trialTemplateRef:
       - apiVersion: kubeflow.org/v1alpha1
          kind: TfJobTrialTemplate
          name: mnist-tfjob

The experiment controller would then create TfJobTrialRuns from the TFJobTrialTemplate for each trial.

I think this is more extensible because the Experiment controller can use duck typing to support whatever Template/Runs a user has installed on their cluster.

So for example if someone wants to use Argo Workflows. They could just implement ArgoTrialTemplate and ArgoTrialRun and install it on their cluster.

andreyvelich · 2020-05-08T23:46:42Z

I'm a little confused because I don't see Experiment in trial_types.go and I thought it was the experiment and not the trial that specified the trial and the search space.

It is because each Experiment CR includes created Trial resources.
In Experiment user defines search space, algorithm and specification for Training job (Trial Template).

The experiment controller would then create TfJobTrialRuns from the TFJobTrialTemplate for each trial.

In that case, these custom resources should be installing to the cluster in advance? E.g TfJobTrialRuns for TFJobs. Or I misunderstand something?
Also, I still can't understand how API will look, if we want to support any user resources, argo workflow, pipelines, tfjobs, etc. and we don't want to use string template: RawTemplate string.

jlewi · 2020-05-12T00:14:47Z

@andreyvelich Do you want to add to the agenda for an upcoming community and we can discuss in person?

It is because each Experiment CR includes created Trial resources.
In Experiment user defines search space, algorithm and specification for Training job (Trial Template)

Ok so TrialTemplate is a field in Experiment

In that case, these custom resources should be installing to the cluster in advance?

Yes

Also, I still can't understand how API will look, if we want to support any user resources, argo workflow, pipelines, tfjobs, etc. and we don't want to use string template: RawTemplate string.

The API for a TrialTemplate for TFJob looks like the following

apiversion: kubeflow.org
kind: tfjobtrialtemplate
spec:
   params:
       - name: numlayers
          type: int
   tfjobspec:
       metadata: ...
       spec:
           tfReplicaSpecs: 
              ps: ...
              worker: ....

The go type is

...
   tfjobspec type struct {
         ...
         spec:   TFJob.Spec

So the Spec is typed.

So I think what I'm proposing isn't that different from what you have today. Basically I'm suggesting the following

Instead of making TrialTemplate a field in Experiment, have Experiment contain a reference to a TrialTemplateResource
Instead of Having 1 TrialTemplate Resource that is a union of N types (GoTemplate, TFJob, etc...) have N TrialTemplate Resources each of which contains a single type of Resource.
Instead of having 1 Trial resource that handles N different types have N trial resources each of which handles 1 type of resource.

andreyvelich · 2020-05-12T10:58:13Z

@jlewi Sure, let's discuss it on Kubeflow meeting.
/cc @gaocegege @johnugeorge

andreyvelich · 2020-05-21T15:27:49Z

I was checking the knative duck implementation and I think that can be one more problem with it.
We can’t specify JSON path in JSON patch for various Kubeflow resources (http://jsonpatch.com/#the-patch).

For example, to reach container for TFJob we need to go to tfReplicaSpecs.worker.template.spec.container[0], but for PytorchJob:
pytorchReplicaSpecs.worker.template.spec.container[0].

If you take a look at knative: https://github.com/knative/pkg/blob/master/apis/duck/v1/implements_test.go#L53, they use VerifyType only where we can directly apply changes to the field.

The same story in Kubernetes (https://github.com/kubernetes/api/blob/e771f807/core/v1/types.go#L3179-L3190). PodTemplateSpec is the sub-resource for k8s each resource (Rs, Job, Deployment).

I was thinking about another way of supporting various resources.

Maybe in Experiment API define:

type TrialTemplate struct {
	TrialParams []TrialParamsSpec          `json:"trialParams,omitempty"`
	TrialSpec   *unstructured.Unstructured `json:"trialSpec,omitempty"`
}

type TrialParamsSpec struct {
	Name        string `json:"name,omitempty"`
	Description string `json:"description,omitempty"`
}

In Trial API we can just have Object Meta, instead of RunSpec (https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/trials/v1alpha3/trial_types.go#L35), like how it is done in k8s API: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L3590. Or we can have all TrialSpec: *unstructured.Unstructured to show all created resource.

In k8s, Unstructured must have valid Yaml and Kind. Also we can validate ApiVersion.
So we can submit every Job from the Trial Template. We can add validation for common resources (Job, TFJob, PytorchJob, etc..), to make sure that user defines these resources correct.

Also we will add TrialParamsSpec to API, which @jlewi proposed here: #906 (comment).

So we can:

Convert TrialSpec Unstructured to JSON
Replace all required parameters from TrialParams
Convert JSON to unstructured
Submit unstructured Job with appropriate Version and Kind

I was testing Unstructured as API in Kuberenetes CRD, it works and we can parse it.

For Metrics Collector I believe it will work also, since we mutate created Trial Pod, not Trial resource.

This approach can help us with supporting any type of resource in Trial Template.

This is example of Experiment:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy_1
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /train
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: --learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: --batch_size
      parameterType: int
      feasibleSpace:
        min: "100"
        max: "200"
  trialTemplate:
    trialParams:
      - name: --learning_rate
        description: Learning Rate
      - name: --batch_size
        description: Batch Size
      - name: trial_name
        description: name of the trial
    trialSpec:
      apiVersion: "kubeflow.org/v1"
      kind: TFJob
      metadata:
        name: ${trialParams.trial_name}
        namespace: kubeflow
      spec:
        tfReplicaSpecs:
        Worker:
          replicas: 1
          restartPolicy: OnFailure
          template:
            spec:
              containers:
                - name: tensorflow
                   image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
                   imagePullPolicy: Always
                   command:
                    - "python"
                    - "/var/tf_mnist/mnist_with_summaries.py"
                    - "--log_dir=/train/metrics"
                    - "--learning_rate=${trialParams.--learning_rate}"
                    - "--batch_size=${trialParams.--batch_size}"

What do you think @jlewi @johnugeorge @gaocegege ?

gaocegege · 2020-05-22T01:25:09Z

LGTM. I prefer this approach. 🎉

jlewi · 2020-05-24T17:23:27Z

Looks good

czheng94 · 2020-06-10T17:15:13Z

Great to see the PR merged 🎉 !! This is so exciting.
So what's next?
Extending TrialTemplate with custom settings, to get state of the Job to support any Kubernetes CRDs?

andreyvelich · 2020-06-10T17:32:40Z

@czheng94 Thank you! Yes, I just created an issue: #1214.

andreyvelich · 2020-06-16T12:06:36Z

We can close this, we are using new Trial template now: #1208.

k8s-ci-robot added the kind/feature label Oct 31, 2019

k8s-ci-robot added the priority/p0 label Nov 12, 2019

johnugeorge mentioned this issue Jan 20, 2020

record TrialNames group by stages in ExperimentStatus #1023

Merged

johnugeorge mentioned this issue Apr 1, 2020

[discussion] Katib 2020 Roadmap #1104

Closed

k8s-ci-robot assigned gaocegege Apr 1, 2020

jlewi mentioned this issue Apr 9, 2020

kustomize package for tekton tektoncd/pipeline#1198

Closed

This was referenced May 28, 2020

Support argo Workflow CRD as new trial kind #1081

Closed

New Trial Template API controller implementation #1202

Merged

andreyvelich closed this as completed Jun 16, 2020

[feature] Reconsider the design of Trial Template #906

[feature] Reconsider the design of Trial Template #906

Comments

gaocegege commented Oct 31, 2019

gaocegege commented Nov 12, 2019

hougangliu commented Nov 12, 2019

gaocegege commented Nov 21, 2019

richardsliu commented Nov 21, 2019

hougangliu commented Nov 22, 2019

richardsliu commented Nov 22, 2019

gaocegege commented Nov 25, 2019

johnugeorge commented Nov 25, 2019

richardsliu commented Nov 26, 2019

johnugeorge commented Nov 27, 2019 • edited Loading

richardsliu commented Nov 27, 2019

jlewi commented Dec 5, 2019

gaocegege commented Dec 5, 2019

jlewi commented Dec 13, 2019

gaocegege commented Dec 16, 2019

andreyvelich commented Jan 17, 2020 • edited Loading

gaocegege commented Jan 19, 2020

johnugeorge commented Jan 19, 2020

gaocegege commented Jan 19, 2020

andreyvelich commented Jan 20, 2020

gaocegege commented Jan 21, 2020

andreyvelich commented Jan 21, 2020

hougangliu commented Jan 22, 2020

andreyvelich commented Jan 22, 2020

gaocegege commented Apr 1, 2020

jlewi commented Apr 5, 2020

gaocegege commented Apr 7, 2020

czheng94 commented Apr 21, 2020

gaocegege commented Apr 27, 2020

czheng94 commented Apr 27, 2020 • edited Loading

andreyvelich commented May 6, 2020 • edited Loading

issue-label-bot bot commented May 6, 2020

gaocegege commented May 7, 2020

jlewi commented May 7, 2020

andreyvelich commented May 7, 2020

jlewi commented May 7, 2020

andreyvelich commented May 7, 2020

gaocegege commented May 8, 2020

jlewi commented May 8, 2020

andreyvelich commented May 8, 2020

jlewi commented May 8, 2020

jlewi commented May 8, 2020

andreyvelich commented May 8, 2020

jlewi commented May 12, 2020

andreyvelich commented May 12, 2020 • edited Loading

andreyvelich commented May 21, 2020

gaocegege commented May 22, 2020

jlewi commented May 24, 2020

czheng94 commented Jun 10, 2020

andreyvelich commented Jun 10, 2020

andreyvelich commented Jun 16, 2020

johnugeorge commented Nov 27, 2019 •

edited

Loading

andreyvelich commented Jan 17, 2020 •

edited

Loading

czheng94 commented Apr 27, 2020 •

edited

Loading

andreyvelich commented May 6, 2020 •

edited

Loading

andreyvelich commented May 12, 2020 •

edited

Loading