Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support argo Workflow CRD as new trial kind #1081

Closed
terrykong opened this issue Mar 10, 2020 · 15 comments · Fixed by #1605
Closed

Support argo Workflow CRD as new trial kind #1081

terrykong opened this issue Mar 10, 2020 · 15 comments · Fixed by #1605

Comments

@terrykong
Copy link

/kind feature

Describe the solution you'd like
For more complicated jobs where we may execute things in separate containers I have noticed that the only job types we support in a trial are Jobs, TFJobs and PyTorchJobs. Would it be possible to also support argo workflows?

Anything else you would like to add:

I see that https://github.com/kubeflow/katib/blob/master/docs/new-trial-kind.md is outdated, but it sounds like for a vanilla workflow, it should be possible to inject a metrics collector sidecar to each workflow container right?

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.99

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@gaocegege
Copy link
Member

@terrykong It is possible. It is extensible. We have a doc for it https://github.com/kubeflow/katib/blob/master/docs/new-trial-kind.md although it may be outdated.

The problem is that whether it is an actual use case.

@terrykong
Copy link
Author

@gaocegege glad to hear it's possible. I don't know about others, but there are certainly convenient use cases covered by workflows. In particular optimizing a training -> benchmarking problem where the benchmarking doesn't need GPUs so the GPUs used by training pod can be freed up while benchmarking. At least to my knowledge this is definitely not possible with Jobs, and I'm guessing not with TFJobs and Pytorchjobs either.

@jlewi jlewi removed the feature label Mar 20, 2020
@nielsmeima
Copy link

nielsmeima commented May 28, 2020

@gaocegege for us this is a use case as well. We would like to be able to tune parameters of arbitrary chains of Docker containers, e.g. executed as an Argo workflow. In such a workflow we could easily mix various languages to achieve our needs, instead of relying on a single container or a specific language. I will try to come up with an implementation, and try to commit back / propose a design if it works out. Any pointers on which parts of the doc are outdated?

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/docs 0.59

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@gaocegege
Copy link
Member

@nielsmeima Thanks for your interest! Now we have a new abstraction Provider in pkg/job/v1beta1/provider.go. You just need to implement the provider for argo, then it should work.

@nielsmeima
Copy link

@gaocegege Thanks, I will give it a go.

@andreyvelich
Copy link
Member

@nielsmeima We are also working on the new Trial Template implementation for the new version: #906 (comment).
After that, you should be able to specify any type of resource as Trial Template.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.57

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@nielsmeima
Copy link

nielsmeima commented May 28, 2020

@andreyvelich Thanks for letting me know, that looks like a much better approach. just FYI: implementing the provider interface for Argo is very straightforward, however when using the Workflow type from Argo as a dependency some issues arise. Argo depends on a different (incompatible) version of k8s.io/apimachinery, therefore it is not possible to compile (since Katib relies on a more recent version of it).

The only short-term solution for this would be to create a fork of k8s.io/apimachinery and rely on that in Katib, however this would be suboptimal to say the least. I will still do this, since I require Argo trials right now. However, this shows that your proposed solution will probably be much better than this way of adding extra providers. It also shows that it is probably better to rely on some generated OpenAPI spec for the CRD types (e.g. for proposed duck typed validation purposes), than the Go types.

@stale
Copy link

stale bot commented Nov 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member

/lifecycle frozen
Related issue: argoproj/argo-workflows#4545.

@andreyvelich
Copy link
Member

We should investigate this comment: argoproj/argo-workflows#4545 (comment) to try to support Argo workflows.
/help

@google-oss-robot
Copy link

@andreyvelich:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

We should investigate this comment: argoproj/argo-workflows#4545 (comment) to try to support Argo workflows.
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-robot google-oss-robot added the help wanted Extra attention is needed label Jun 21, 2021
@andreyvelich
Copy link
Member

/area release
/priority p1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants