-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Modify Job provider to support any kind of Kubernetes CRDs #1214
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
Issue Label Bot is not confident enough to auto-label this issue. |
I would like to contribute a bit to point 2 you mentioned @andreyvelich. I implemented support for the Argo Workflow CRD using the v1alpha3 When running an Argo Workflow with a DAG template (https://argoproj.github.io/docs/argo/examples/readme.html#dag) each pod is started with two containers: I currently added a simple addition to the filter statement in I have thought of the following potential solutions:
Furthermore, since we now want to be able to watch for multiple PIDs to run to completion in
and a statement at the beginning of the loop to filter out PIDs which are already completed:
If you have any questions or if I am being unclear, please let me know. |
Thank you! You are more than welcome to contribute. This is great that Argo Workflow works with Katib.
Did you check what was the list in
I'd like this approach and I have the same suggestion in my mind. I think metrics collector should wait until all processes will be succeeded. Only few concerns that I see here:
I agree. We should think about more general solution here. Maybe we can define few rules for the user's CRD.
I think we need to add additional field that represents training container. For example, user can have Argo workflow as Trial Template and only first step must be training. I can see that container in Argo is just a simple K8s container: https://github.com/argoproj/argo/blob/master/pkg/apis/workflow/v1alpha1/workflow_types.go#L412. Can you specify name there, if you don't want to name your container
Sure, feel free to submit the PR. |
We will have a meeting at https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=6&day=30&hour=13&min=0&sec=0&p1=136&p2=438&p3=237&p4=179 Feel free to join if you are interested in the feature. |
Meeting will be on 16th of July at the above time (1pm UTC). |
Some thoughts I have:
The original logic is here To support arbitrary CRD in TrialTemplate, we need to make this "succeeded" condition type TrialTemplate struct {
// Retain indicates that Trial resources must be not cleanup
Retain bool `json:"retain,omitempty"`
// Source for Trial template (unstructured structure or config map)
TrialSource `json:",inline"`
// List of parameres that are used in Trial template
TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"`
// List of Custom CRD condition names that represents succeeded status
SucceededConditions []string
}
The crucial step seems to be determining whether or not a pod is master role, We can also provide a MasterRoleLabel option in the TrialTemplate: type TrialTemplate struct {
// Retain indicates that Trial resources must be not cleanup
Retain bool `json:"retain,omitempty"`
// Source for Trial template (unstructured structure or config map)
TrialSource `json:",inline"`
// List of parameres that are used in Trial template
TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"`
// List of Custom CRD condition names that represents succeeded status
SucceededConditions []string
// Labels that determines whether or not pods are
MasterRoleLabels map[string]string
}
We can let users take care of this by themselves. I believe it's less "invasive" to their custom CRD. |
@czheng94 @gaocegege @johnugeorge Thanks for attending Katib meeting today and driving this. From my understanding, these steps should cover new feature:
For that, we add additional flag to Katib controller, which represents resources that can be used in Trial template. Trial controller iterates over these params and create watchers. For example, if Trial can run TFJob, Argo Workflow and k8s Batch Jobs, Katib controller flags must be: args:
- "-webhook-port=8443"
- "-trial-resource=TFJob.v1.kubeflow.org".
- "-trial-resource=Workflow.v1alpha1.argoproj.io"
- "-trial-resource=Job.v1.batch"
As @czheng94 proposed above, we add additional parameter to @gaocegege @johnugeorge For the API do want to define this parameter as
Sidecar container needs to know which container is the training job. We can send this information using API. API can looks like this: type TrialTemplate struct {
...
containerName string `json:"containerName,omitempty"`
...
}
As @nielsmeima mentioned above using this approach can help to avoid other logic that can be implemented inside the CR. Metrics collector should start only after all processes are finished. We need to validate that distributive training with activating more than one process (e.g https://docs.fast.ai/distributed.html#launch-your-training) also works.
As @czheng94 mentioned above, we update API with: type TrialTemplate struct {
...
SucceededCondition string `json:"succeededCondition,omitempty"`
...
} We verifies that condition with
For new Experiments users can do it manually. I think every step can be in separate PR. We need to make sure that current Katib controller logic works and implement new feature above of it. Later we can clean-up redundant code. What do you think @gaocegege @johnugeorge @czheng94 ? |
Can we have a proposal for the feature? Ppl like Jeremy and @sperlingxx may be interested in it too. |
@gaocegege Ok, I will submit proposal. |
@nielsmeima Can you give more insights how you were using Argo workflows with Katib, please? The problem that I can see that We injected metrics collector as a sidecar and it will be deleted once main is finished, whether we Wait for all processes or wait only for the main process. |
This feature is implemented in upstream. |
/kind feature
After migrating to the new Trial Template design (#1202), we want to extend current Job provider to support any kind of Kubernetes CRDs that follow Trial job patterns (e.g Argo template: #1081).
Currently, Job provider supports only
batch
Jobs and Kubeflow Jobs.We can extend Trial Template API with the custom settings to define:
sidecar.istio.io/inject: false
annotation. (That can be done by user in advance).Maybe we should define something more, this needs to be investigated.
Let's discuss about all required changes in this issue.
/cc @gaocegege @johnugeorge @czheng94 @nielsmeima
The text was updated successfully, but these errors were encountered: