[Feature] Modify Job provider to support any kind of Kubernetes CRDs #1214

andreyvelich · 2020-06-10T17:31:30Z

/kind feature

After migrating to the new Trial Template design (#1202), we want to extend current Job provider to support any kind of Kubernetes CRDs that follow Trial job patterns (e.g Argo template: #1081).

Currently, Job provider supports only batch Jobs and Kubeflow Jobs.

We can extend Trial Template API with the custom settings to define:

How to get succeeded state of the Job.
How to properly mutate metrics collector on training pod.
How to specify sidecar.istio.io/inject: false annotation. (That can be done by user in advance).

Maybe we should define something more, this needs to be investigated.

Let's discuss about all required changes in this issue.

/cc @gaocegege @johnugeorge @czheng94 @nielsmeima

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-06-10T17:31:39Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/katib	0.80

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

issue-label-bot · 2020-06-10T17:31:39Z

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

nielsmeima · 2020-06-10T17:59:15Z

I would like to contribute a bit to point 2 you mentioned @andreyvelich. I implemented support for the Argo Workflow CRD using the v1alpha3 Provider interface and discovered some issues which are also relevant to the new proposed API (great work btw, looks very exciting).

When running an Argo Workflow with a DAG template (https://argoproj.github.io/docs/argo/examples/readme.html#dag) each pod is started with two containers: wait and main. The wait container is backed by the argoexec image and can be considered a side-car to the main container, which is the "training" container. In the current implementation we can mark a single container in a pod the "training" container using the IsTrainingContainer func on the Provider interface. This leads to an issue in pns.go, where the PID of the wait / argoexec container is watched for completion. However, since this container is never marked to be watched for completion in inject_webhook (getMarkCompletedCommand), the watch loop crashes when trying to read the /var/log/katib/<PID>.pid file.

I currently added a simple addition to the filter statement in pns.go (pid == 1 || pid == thisPID || proc.PPid() != 0 || proc.Executable() == "argoexec") as a temporary solution. However, it is clear that we should introduce a mechanism to account for this.

I have thought of the following potential solutions:

Simply wrap all processes as we currently do with containers marked as training containers. This approach would be compatible with other tool that are integrating too: they might control their own sidecars. We simply watch till every sidecar completes before deleting the CRD in question.
Filter all containers (similar to my temporary solution above, but more general) except for the training container. This approach is not preferred, because the status of other (potentially important) sidecars is ignored.
Introduce a field in the new Trial spec API which allows a list of containers to be specified which must be watched for completion.

Furthermore, since we now want to be able to watch for multiple PIDs to run to completion in pns.go we want to set waitAll to true. However, there is a bug in the loop which watches for completion of all PIDs. I can make a PR for a fix, since I had to already fix it for my own Argo implementation as well. The bug can be seen by comparing pns.py and pns.go. In the case of the golang version, once a single PID has completed, the error condition at the bottom is reached instead of also watching completion for the other containers. The solution would be to add an else statement:

if len(finishedPids) == len(pids) {
    return nil
} else {
    break
}

and a statement at the beginning of the loop to filter out PIDs which are already completed:

var skip = false
for _, finishedPid := range finishedPids {
    if finishedPid == pid {
        skip = true
        break
    }
}

if skip {
    continue
}

If you have any questions or if I am being unclear, please let me know.

andreyvelich · 2020-06-11T00:26:03Z

I would like to contribute a bit to point 2 you mentioned @andreyvelich. I implemented support for the Argo Workflow CRD using the v1alpha3 Provider interface and discovered some issues which are also relevant to the new proposed API (great work btw, looks very exciting).

Thank you! You are more than welcome to contribute. This is great that Argo Workflow works with Katib.
I think any new changes that we will make should be done in v1beta1. We will release v1alpha3 soon.

I currently added a simple addition to the filter statement in pns.go (pid == 1 || pid == thisPID || proc.PPid() != 0 || proc.Executable() == "argoexec") as a temporary solution. However, it is clear that we should introduce a mechanism to account for this.

Did you check what was the list in pids. Why do you need to add proc.Executable() == "argoexec"?

I have thought of the following potential solutions:

Simply wrap all processes as we currently do with containers marked as training containers. This approach would be compatible with other tool that are integrating too: they might control their own sidecars. We simply watch till every sidecar completes before deleting the CRD in question.

I'd like this approach and I have the same suggestion in my mind. I think metrics collector should wait until all processes will be succeeded. Only few concerns that I see here:

What about Init containers. Will we see the init containers processes in the proc list, or they will be succeeded when we are injecting metrics collector container?
Is there any case, when any other processes will be running on Pod ones main training job is finished? How we should indicate them and delete from proc list?

Filter all containers (similar to my temporary solution above, but more general) except for the training container. This approach is not preferred, because the status of other (potentially important) sidecars is ignored.

I agree. We should think about more general solution here. Maybe we can define few rules for the user's CRD.

Introduce a field in the new Trial spec API which allows a list of containers to be specified which must be watched for completion.

I think we need to add additional field that represents training container. For example, user can have Argo workflow as Trial Template and only first step must be training. I can see that container in Argo is just a simple K8s container: https://github.com/argoproj/argo/blob/master/pkg/apis/workflow/v1alpha1/workflow_types.go#L412. Can you specify name there, if you don't want to name your container main as you mentioned ?

Furthermore, since we now want to be able to watch for multiple PIDs to run to completion in pns.go we want to set waitAll to true. However, there is a bug in the loop which watches for completion of all PIDs. I can make a PR for a fix, since I had to already fix it for my own Argo implementation as well. The bug can be seen by comparing pns.py and pns.go. In the case of the golang version, once a single PID has completed, the error condition at the bottom is reached instead of also watching completion for the other containers. The solution would be to add an else statement:

Sure, feel free to submit the PR.

gaocegege · 2020-07-15T01:38:43Z

We will have a meeting at https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=6&day=30&hour=13&min=0&sec=0&p1=136&p2=438&p3=237&p4=179

Feel free to join if you are interested in the feature.

andreyvelich · 2020-07-15T11:38:35Z

Meeting will be on 16th of July at the above time (1pm UTC).
Join #katib Slack channel for more information.

czheng94 · 2020-07-16T12:59:34Z

Some thoughts I have:

How to get succeeded state of the Job

The original logic is here
and here

To support arbitrary CRD in TrialTemplate, we need to make this "succeeded" condition
configurable:

type TrialTemplate struct {
	// Retain indicates that Trial resources must be not cleanup
	Retain bool `json:"retain,omitempty"`

	// Source for Trial template (unstructured structure or config map)
	TrialSource `json:",inline"`

	// List of parameres that are used in Trial template
	TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"`
	
	// List of Custom CRD condition names that represents succeeded status 
	SucceededConditions []string
}

How to properly mutate metrics collector on training pod?

The crucial step seems to be determining whether or not a pod is master role,
The original logic is here

We can also provide a MasterRoleLabel option in the TrialTemplate:

type TrialTemplate struct {
	// Retain indicates that Trial resources must be not cleanup
	Retain bool `json:"retain,omitempty"`

	// Source for Trial template (unstructured structure or config map)
	TrialSource `json:",inline"`

	// List of parameres that are used in Trial template
	TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"`
	
	// List of Custom CRD condition names that represents succeeded status 
	SucceededConditions []string
	
	// Labels that determines whether or not pods are
	MasterRoleLabels map[string]string
}

How to specify sidecar.istio.io/inject: false annotation.

We can let users take care of this by themselves. I believe it's less "invasive" to their custom CRD.

andreyvelich · 2020-07-16T22:41:19Z

@czheng94 @gaocegege @johnugeorge Thanks for attending Katib meeting today and driving this.

From my understanding, these steps should cover new feature:

- Trial controller must Watch for the each CRs that user wants to use in Template, like now it watches for TFJob and PytorchJob.

For that, we add additional flag to Katib controller, which represents resources that can be used in Trial template. Trial controller iterates over these params and create watchers.

For example, if Trial can run TFJob, Argo Workflow and k8s Batch Jobs, Katib controller flags must be:

args:
  - "-webhook-port=8443"
  - "-trial-resource=TFJob.v1.kubeflow.org".
  - "-trial-resource=Workflow.v1alpha1.argoproj.io"
  - "-trial-resource=Job.v1.batch"

- Find the main Pod that needs to be injected from the distributive CR run (e.g. TFJob). In the other words, refactor isMasterRole.

As @czheng94 proposed above, we add additional parameter to TrialTemplate where we define label name and label value for the master Pod. We need to rename "master" to another name that will be not only specific for training Kubeflow operators.

@gaocegege @johnugeorge For the API do want to define this parameter as map[string]string or having similar to metricsStrategies with Name, Value ?

- Do we want to support mutating sidecar container on more than one Pod simultaneously? (Can be done later).
- Specify the training container name in TrialTemplate.

Sidecar container needs to know which container is the training job. We can send this information using API.

API can looks like this:

type TrialTemplate struct {
         ...
	containerName string `json:"containerName,omitempty"`
         ...
}

- Metrics collector should wait until all other processes on the Pod will be finished before starting parsing.

As @nielsmeima mentioned above using this approach can help to avoid other logic that can be implemented inside the CR. Metrics collector should start only after all processes are finished.

We need to validate that distributive training with activating more than one process (e.g https://docs.fast.ai/distributed.html#launch-your-training) also works.

- Get succeeded state for the custom CR.

As @czheng94 mentioned above, we update API with:

type TrialTemplate struct {
         ...
	SucceededCondition string `json:"succeededCondition,omitempty"`
         ...
}

We verifies that condition with type=SucceededCondition and status=True exists in .status.conditions of custom CR.

- Add sidecar.istio.io/inject: false label, that I mentioned before, to all Katib examples.

For new Experiments users can do it manually.

I think every step can be in separate PR. We need to make sure that current Katib controller logic works and implement new feature above of it. Later we can clean-up redundant code.

What do you think @gaocegege @johnugeorge @czheng94 ?
Please, let me know if I missed something.

/cc @jlewi @sperlingxx @nielsmeima

gaocegege · 2020-07-17T00:36:05Z

Can we have a proposal for the feature? Ppl like Jeremy and @sperlingxx may be interested in it too.

andreyvelich · 2020-07-17T22:17:05Z

@gaocegege Ok, I will submit proposal.

andreyvelich · 2020-09-09T16:08:51Z

@nielsmeima Can you give more insights how you were using Argo workflows with Katib, please?

The problem that I can see that wait container deletes all sidecar once main is finised (ref: https://github.com/argoproj/argo/blob/master/cmd/argoexec/commands/wait.go#L33).

We injected metrics collector as a sidecar and it will be deleted once main is finished, whether we Wait for all processes or wait only for the main process.

andreyvelich · 2020-10-16T23:29:41Z

This feature is implemented in upstream.

k8s-ci-robot added the kind/feature label Jun 10, 2020

issue-label-bot bot added the area/katib label Jun 10, 2020

andreyvelich mentioned this issue Jun 10, 2020

[feature] Reconsider the design of Trial Template #906

Closed

andreyvelich mentioned this issue Jul 18, 2020

Proposal: Support custom CRD in Trial Job #1273

Merged

This was referenced Aug 14, 2020

Custom CRD: Set dynamic watch from controller flags #1302

Merged

Custom CRD: Add primary pod labels #1305

Merged

Custom CRD: Support dynamic Trial's jobs conditions #1307

Merged

Custom CRD: Add primary container name #1308

Merged

This was referenced Aug 21, 2020

[Release 1.2] Feature Planning / Roadmap kubeflow/kubeflow#5224

Closed

Custom CRD: Wait for all processes before running metrics collector #1313

Merged

andreyvelich mentioned this issue Sep 2, 2020

Refactoring Supported Job List #1320

Closed

This was referenced Sep 15, 2020

Remove istio sidecar annotation from all examples #1338

Merged

Add Tekton Pipeline example #1339

Merged

andreyvelich closed this as completed Oct 16, 2020

andreyvelich mentioned this issue Oct 17, 2020

[Release] Katib 0.10 release for Kubeflow 1.2 #1360

Closed

12 tasks

jbottum mentioned this issue Nov 19, 2020

Kubeflow 1.2 Blog post kubeflow/community#455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Modify Job provider to support any kind of Kubernetes CRDs #1214

[Feature] Modify Job provider to support any kind of Kubernetes CRDs #1214

andreyvelich commented Jun 10, 2020

issue-label-bot bot commented Jun 10, 2020

issue-label-bot bot commented Jun 10, 2020

nielsmeima commented Jun 10, 2020 •

edited

Loading

andreyvelich commented Jun 11, 2020

gaocegege commented Jul 15, 2020

andreyvelich commented Jul 15, 2020

czheng94 commented Jul 16, 2020

andreyvelich commented Jul 16, 2020

gaocegege commented Jul 17, 2020

andreyvelich commented Jul 17, 2020

andreyvelich commented Sep 9, 2020

andreyvelich commented Oct 16, 2020

[Feature] Modify Job provider to support any kind of Kubernetes CRDs #1214

[Feature] Modify Job provider to support any kind of Kubernetes CRDs #1214

Comments

andreyvelich commented Jun 10, 2020

issue-label-bot bot commented Jun 10, 2020

issue-label-bot bot commented Jun 10, 2020

nielsmeima commented Jun 10, 2020 • edited Loading

andreyvelich commented Jun 11, 2020

gaocegege commented Jul 15, 2020

andreyvelich commented Jul 15, 2020

czheng94 commented Jul 16, 2020

andreyvelich commented Jul 16, 2020

gaocegege commented Jul 17, 2020

andreyvelich commented Jul 17, 2020

andreyvelich commented Sep 9, 2020

andreyvelich commented Oct 16, 2020

nielsmeima commented Jun 10, 2020 •

edited

Loading