-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: add support for mpijob in katib #1183
feature: add support for mpijob in katib #1183
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Hi @YuxiJin-tobeyjin. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@YuxiJin-tobeyjin: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@googlebot I signed it! |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign @andreyvelich @johnugeorge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YuxiJin-tobeyjin Thank you for doing this!
Can you add e2e test for MPI Job, please?
|
||
// built-in JobRoles | ||
JobRole = "job-role" | ||
JobRoleTF = "tf-job-role" | ||
JobRolePyTorch = "pytorch-job-role" | ||
JobRoleMpi = "mpi_role_type" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job roles in MPI are named with "_" not with "-" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrytangyuan Hi, will we change the label in the future, maybe v1beta1 or v1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @gaocegege. I think MPI should follow the same patterns as TF Job and Pytorch Job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I changed it in kubeflow/mpi-operator#252 for v1 candidate of MPI Operator. Perhaps this PR can add support for v1 candidate directly? The API should be relatively stable even though there isn't official release yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @gaocegege .
@YuxiJin-tobeyjin can you change this PR to support v1 MPI Operator version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but I need to make some tests on latest mpi master branch to ensure changes work as expected, I will do it ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! 🎉 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrytangyuan I've tried mpi-operator master branch on our cluster with kubernetes 1.14 and encountered some problems.
First, when creating new mpijob crd ,it throws up
unknown field "additionalPrinterColumns" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1beta1.CustomResourceValidation
Walk around then, the crd has been created ok, but the mpi-operator is created failed
the server could not find the requested resource (get mpijobs.kubeflow.org)
Maybe problem is here ,still use KubeflowV1alpha2 to get v1 crd?
https://github.com/kubeflow/mpi-operator/blob/acddf3028ce922e24ac3e735a2928ff4487be28f/cmd/mpi-operator.v1/app/server.go#L288
Walk around then, but mpi-operator throws
Failed to list *v1beta1.PodGroup: podgroups.scheduling.volcano.sh
So , now use mpi-operator v1 we must use volcano as the batch scheduler? Since kube-batch latest release do not support v1beta1.PodGroup. Actually we've already used the kube-batch latest release for some times...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I fixed the CRD validation issue in kubeflow/mpi-operator#257 (comment). Yes Volcano has better support and maintenance. You may want to update your MPI Operator's deployment yaml which includes permissions for Volcano resources: https://github.com/kubeflow/mpi-operator/blob/master/deploy/v1/mpi-operator.yaml
@@ -30,7 +30,7 @@ const ( | |||
|
|||
TrainingCompleted = "completed" | |||
|
|||
DefaultFilter = `([\w|-]+)\s*=\s*((-?\d+)(\.\d+)?)` | |||
DefaultFilter = `([\w|-]+)\s*[:=]\s*((-?\d+)(\.\d+)?)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to change DefaultFilter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases ,the print logs are like “filtername : *%“ or “filtername = *%“, so I prefer the default filter to support both “:” and "=", would it be better? @andreyvelich WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change this, I think we should modify documentation about default filter for Metrics Collector. For example here: https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#metrics-collector.
Do you think specifying filter with filter format (:) will be not handy for Katib user?
E.g, like we did for DARTS example: https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/nas/darts-example-gpu.yaml#L19.
Your thoughts @gaocegege @johnugeorge ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep it to enhance the default filter. But we need to open an issue to update the doc.
OwnerReferences: []metav1.OwnerReference{ | ||
{ | ||
APIVersion: "batch/v1", | ||
Kind: "Job", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's related here but just FYI we removed Job and Statefulset for v1 candidate here so now MPI Operator only creates pods: kubeflow/mpi-operator#203
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(⊙o⊙)… we developed some features based on worker's sts property...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrytangyuan Is there any plan when to release a new mpi-operator version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No concrete timeline yet but it should be relatively stable now as we are getting close to graduation.
@andreyvelich The proposal was merged, could we move this PR forward or submit a new one? |
Close cause #1273 will give a better solution. |
What this PR does / why we need it:
Add support for mpijob in katib
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #1181
Release note:
cc @gaocegege @johnugeorge @andreyvelich