Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Modify Job provider to support any kind of Kubernetes CRDs #1214

Closed
andreyvelich opened this issue Jun 10, 2020 · 12 comments
Closed

Comments

@andreyvelich
Copy link
Member

/kind feature

After migrating to the new Trial Template design (#1202), we want to extend current Job provider to support any kind of Kubernetes CRDs that follow Trial job patterns (e.g Argo template: #1081).

Currently, Job provider supports only batch Jobs and Kubeflow Jobs.

We can extend Trial Template API with the custom settings to define:

  1. How to get succeeded state of the Job.
  2. How to properly mutate metrics collector on training pod.
  3. How to specify sidecar.istio.io/inject: false annotation. (That can be done by user in advance).

Maybe we should define something more, this needs to be investigated.

Let's discuss about all required changes in this issue.

/cc @gaocegege @johnugeorge @czheng94 @nielsmeima

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.80

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

@nielsmeima
Copy link

nielsmeima commented Jun 10, 2020

I would like to contribute a bit to point 2 you mentioned @andreyvelich. I implemented support for the Argo Workflow CRD using the v1alpha3 Provider interface and discovered some issues which are also relevant to the new proposed API (great work btw, looks very exciting).

When running an Argo Workflow with a DAG template (https://argoproj.github.io/docs/argo/examples/readme.html#dag) each pod is started with two containers: wait and main. The wait container is backed by the argoexec image and can be considered a side-car to the main container, which is the "training" container. In the current implementation we can mark a single container in a pod the "training" container using the IsTrainingContainer func on the Provider interface. This leads to an issue in pns.go, where the PID of the wait / argoexec container is watched for completion. However, since this container is never marked to be watched for completion in inject_webhook (getMarkCompletedCommand), the watch loop crashes when trying to read the /var/log/katib/<PID>.pid file.

I currently added a simple addition to the filter statement in pns.go (pid == 1 || pid == thisPID || proc.PPid() != 0 || proc.Executable() == "argoexec") as a temporary solution. However, it is clear that we should introduce a mechanism to account for this.

I have thought of the following potential solutions:

  1. Simply wrap all processes as we currently do with containers marked as training containers. This approach would be compatible with other tool that are integrating too: they might control their own sidecars. We simply watch till every sidecar completes before deleting the CRD in question.
  2. Filter all containers (similar to my temporary solution above, but more general) except for the training container. This approach is not preferred, because the status of other (potentially important) sidecars is ignored.
  3. Introduce a field in the new Trial spec API which allows a list of containers to be specified which must be watched for completion.

Furthermore, since we now want to be able to watch for multiple PIDs to run to completion in pns.go we want to set waitAll to true. However, there is a bug in the loop which watches for completion of all PIDs. I can make a PR for a fix, since I had to already fix it for my own Argo implementation as well. The bug can be seen by comparing pns.py and pns.go. In the case of the golang version, once a single PID has completed, the error condition at the bottom is reached instead of also watching completion for the other containers. The solution would be to add an else statement:

if len(finishedPids) == len(pids) {
    return nil
} else {
    break
}

and a statement at the beginning of the loop to filter out PIDs which are already completed:

var skip = false
for _, finishedPid := range finishedPids {
    if finishedPid == pid {
        skip = true
        break
    }
}

if skip {
    continue
}

If you have any questions or if I am being unclear, please let me know.

@andreyvelich
Copy link
Member Author

I would like to contribute a bit to point 2 you mentioned @andreyvelich. I implemented support for the Argo Workflow CRD using the v1alpha3 Provider interface and discovered some issues which are also relevant to the new proposed API (great work btw, looks very exciting).

Thank you! You are more than welcome to contribute. This is great that Argo Workflow works with Katib.
I think any new changes that we will make should be done in v1beta1. We will release v1alpha3 soon.

I currently added a simple addition to the filter statement in pns.go (pid == 1 || pid == thisPID || proc.PPid() != 0 || proc.Executable() == "argoexec") as a temporary solution. However, it is clear that we should introduce a mechanism to account for this.

Did you check what was the list in pids. Why do you need to add proc.Executable() == "argoexec"?

I have thought of the following potential solutions:

  1. Simply wrap all processes as we currently do with containers marked as training containers. This approach would be compatible with other tool that are integrating too: they might control their own sidecars. We simply watch till every sidecar completes before deleting the CRD in question.

I'd like this approach and I have the same suggestion in my mind. I think metrics collector should wait until all processes will be succeeded. Only few concerns that I see here:

  1. What about Init containers. Will we see the init containers processes in the proc list, or they will be succeeded when we are injecting metrics collector container?

  2. Is there any case, when any other processes will be running on Pod ones main training job is finished? How we should indicate them and delete from proc list?

  1. Filter all containers (similar to my temporary solution above, but more general) except for the training container. This approach is not preferred, because the status of other (potentially important) sidecars is ignored.

I agree. We should think about more general solution here. Maybe we can define few rules for the user's CRD.

  1. Introduce a field in the new Trial spec API which allows a list of containers to be specified which must be watched for completion.

I think we need to add additional field that represents training container. For example, user can have Argo workflow as Trial Template and only first step must be training. I can see that container in Argo is just a simple K8s container: https://github.com/argoproj/argo/blob/master/pkg/apis/workflow/v1alpha1/workflow_types.go#L412. Can you specify name there, if you don't want to name your container main as you mentioned ?

Furthermore, since we now want to be able to watch for multiple PIDs to run to completion in pns.go we want to set waitAll to true. However, there is a bug in the loop which watches for completion of all PIDs. I can make a PR for a fix, since I had to already fix it for my own Argo implementation as well. The bug can be seen by comparing pns.py and pns.go. In the case of the golang version, once a single PID has completed, the error condition at the bottom is reached instead of also watching completion for the other containers. The solution would be to add an else statement:

Sure, feel free to submit the PR.

@gaocegege
Copy link
Member

We will have a meeting at https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=6&day=30&hour=13&min=0&sec=0&p1=136&p2=438&p3=237&p4=179

Feel free to join if you are interested in the feature.

@andreyvelich
Copy link
Member Author

Meeting will be on 16th of July at the above time (1pm UTC).
Join #katib Slack channel for more information.

@czheng94
Copy link

Some thoughts I have:

  1. How to get succeeded state of the Job

The original logic is here
and here

To support arbitrary CRD in TrialTemplate, we need to make this "succeeded" condition
configurable:

type TrialTemplate struct {
	// Retain indicates that Trial resources must be not cleanup
	Retain bool `json:"retain,omitempty"`

	// Source for Trial template (unstructured structure or config map)
	TrialSource `json:",inline"`

	// List of parameres that are used in Trial template
	TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"`
	
	// List of Custom CRD condition names that represents succeeded status 
	SucceededConditions []string
}
  1. How to properly mutate metrics collector on training pod?

The crucial step seems to be determining whether or not a pod is master role,
The original logic is here

We can also provide a MasterRoleLabel option in the TrialTemplate:

type TrialTemplate struct {
	// Retain indicates that Trial resources must be not cleanup
	Retain bool `json:"retain,omitempty"`

	// Source for Trial template (unstructured structure or config map)
	TrialSource `json:",inline"`

	// List of parameres that are used in Trial template
	TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"`
	
	// List of Custom CRD condition names that represents succeeded status 
	SucceededConditions []string
	
	// Labels that determines whether or not pods are
	MasterRoleLabels map[string]string
}
  1. How to specify sidecar.istio.io/inject: false annotation.

We can let users take care of this by themselves. I believe it's less "invasive" to their custom CRD.

@andreyvelich
Copy link
Member Author

@czheng94 @gaocegege @johnugeorge Thanks for attending Katib meeting today and driving this.

From my understanding, these steps should cover new feature:

    • Trial controller must Watch for the each CRs that user wants to use in Template, like now it watches for TFJob and PytorchJob.

For that, we add additional flag to Katib controller, which represents resources that can be used in Trial template. Trial controller iterates over these params and create watchers.

For example, if Trial can run TFJob, Argo Workflow and k8s Batch Jobs, Katib controller flags must be:

args:
  - "-webhook-port=8443"
  - "-trial-resource=TFJob.v1.kubeflow.org".
  - "-trial-resource=Workflow.v1alpha1.argoproj.io"
  - "-trial-resource=Job.v1.batch"
    • Find the main Pod that needs to be injected from the distributive CR run (e.g. TFJob). In the other words, refactor isMasterRole.

As @czheng94 proposed above, we add additional parameter to TrialTemplate where we define label name and label value for the master Pod. We need to rename "master" to another name that will be not only specific for training Kubeflow operators.

@gaocegege @johnugeorge For the API do want to define this parameter as map[string]string or having similar to metricsStrategies with Name, Value ?

    • Do we want to support mutating sidecar container on more than one Pod simultaneously? (Can be done later).
    • Specify the training container name in TrialTemplate.

Sidecar container needs to know which container is the training job. We can send this information using API.

API can looks like this:

type TrialTemplate struct {
         ...
	containerName string `json:"containerName,omitempty"`
         ...
}
    • Metrics collector should wait until all other processes on the Pod will be finished before starting parsing.

As @nielsmeima mentioned above using this approach can help to avoid other logic that can be implemented inside the CR. Metrics collector should start only after all processes are finished.

We need to validate that distributive training with activating more than one process (e.g https://docs.fast.ai/distributed.html#launch-your-training) also works.

    • Get succeeded state for the custom CR.

As @czheng94 mentioned above, we update API with:

type TrialTemplate struct {
         ...
	SucceededCondition string `json:"succeededCondition,omitempty"`
         ...
}

We verifies that condition with type=SucceededCondition and status=True exists in .status.conditions of custom CR.

    • Add sidecar.istio.io/inject: false label, that I mentioned before, to all Katib examples.

For new Experiments users can do it manually.

I think every step can be in separate PR. We need to make sure that current Katib controller logic works and implement new feature above of it. Later we can clean-up redundant code.

What do you think @gaocegege @johnugeorge @czheng94 ?
Please, let me know if I missed something.

/cc @jlewi @sperlingxx @nielsmeima

@gaocegege
Copy link
Member

Can we have a proposal for the feature? Ppl like Jeremy and @sperlingxx may be interested in it too.

@andreyvelich
Copy link
Member Author

@gaocegege Ok, I will submit proposal.

@andreyvelich
Copy link
Member Author

@nielsmeima Can you give more insights how you were using Argo workflows with Katib, please?

The problem that I can see that wait container deletes all sidecar once main is finised (ref: https://github.com/argoproj/argo/blob/master/cmd/argoexec/commands/wait.go#L33).

We injected metrics collector as a sidecar and it will be deleted once main is finished, whether we Wait for all processes or wait only for the main process.

@andreyvelich
Copy link
Member Author

This feature is implemented in upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants