-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Never Resume Policy for Experiment #1184
Fix Never Resume Policy for Experiment #1184
Conversation
Add e2e test for never resume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sry for making so many bugs, and I am so working on this ^ ^.
@@ -193,17 +193,24 @@ func (r *ReconcileExperiment) Reconcile(request reconcile.Request) (reconcile.Re | |||
if instance.IsCompleted() { | |||
// Check if completed instance is restartable | |||
// Experiment is restartable only if it is in succeeded state by reaching max trials | |||
// And Resume Policy is LongRunning | |||
if util.IsCompletedExperimentRestartable(instance) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this paragraph could be modified to something below
if instance.IsCompleted() {
needRestart := false
// Check if completed instance is restartable
// Experiment is restartable only if it is in succeeded state by reaching max trials
if util.IsCompletedExperimentRestartable(instance) {
// Check if max trials is reconfigured
if (instance.Spec.MaxTrialCount != nil &&
*instance.Spec.MaxTrialCount != instance.Status.Trials) ||
(instance.Spec.MaxTrialCount == nil && instance.Status.Trials != 0) {
msg := "Experiment is restarted"
instance.MarkExperimentStatusRestarting(util.ExperimentRestartingReason, msg)
needRestart = true
}
}
// If experiment who doesn't need to restart is completed without running trials, stop reconcile
if !needRestart && !instance.HasRunningTrials() {
if instance.Spec.ResumePolicy != experimentsv1alpha3.LongRunning {
return r.terminateSuggestion(instance)
}
return reconcile.Result{}, nil
}
}
Otherwise, terminateSuggestion
won't work when corresponding experiment has been completed for reaching max trials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sperlingxx Why terminateSuggestion
won't run when experiment finishes with reaching max trials?
I believe, Reconcile loop runs with Succeeded experiment state, after reaching this step: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/util/status_util.go#L182.
I am wondering, do we actually need to check if Experiment has running trials in the controller @gaocegege @johnugeorge :
https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/experiment_controller.go#L209 ?
Or we can just return reconcile.Result{}, nil
without this check if Experiment is Completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich I think util.IsCompletedExperimentRestartable
will be true when experiment finishes with reaching max trials. But, for now, terminateSuggestion
will only be called when util.IsCompletedExperimentRestartable
is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but util.IsCompletedExperimentRestartable
returns True only if ResumePolicy: LongRunning
.
https://github.com/kubeflow/katib/pull/1184/files#diff-f0faa0b63a35acff95fe5e9d93d594ffR201
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Oh, I got it!
@sperlingxx Everyone creates bugs. Do not need to worry about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sperlingxx No worries! Thank you for the review.
@@ -193,17 +193,24 @@ func (r *ReconcileExperiment) Reconcile(request reconcile.Request) (reconcile.Re | |||
if instance.IsCompleted() { | |||
// Check if completed instance is restartable | |||
// Experiment is restartable only if it is in succeeded state by reaching max trials | |||
// And Resume Policy is LongRunning | |||
if util.IsCompletedExperimentRestartable(instance) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sperlingxx Why terminateSuggestion
won't run when experiment finishes with reaching max trials?
I believe, Reconcile loop runs with Succeeded experiment state, after reaching this step: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/util/status_util.go#L182.
I am wondering, do we actually need to check if Experiment has running trials in the controller @gaocegege @johnugeorge :
https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/experiment_controller.go#L209 ?
Or we can just return reconcile.Result{}, nil
without this check if Experiment is Completed?
/retest |
1 similar comment
/retest |
4456981
to
863a80a
Compare
I think it was a problem with new version of Scikit Learn 0.23. |
/approve |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, sperlingxx The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Fix Never Resume Suggestion Add e2e test for never resume * Fix name for never resume in e2e * Add permission on run never resume
I fixed some problems in
Never
resume policy.IsCompletedExperimentRestartable
should returntrue
only ifResumePolicy: LongRunning
.terminateSuggestion
only forResumePolicy: Never
.terminateSuggestion
twice. I added check if Suggestion is already succeeded.g.Status().Update()
instead ofg.Update()
to update status of suggestion CR.terminateSuggestion
just returns error, I don't think we need to requeue Reconcile:reconcile.Result{Requeue: true}
./assign @johnugeorge @gaocegege
/cc @sperlingxx