Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Never Resume Policy for Experiment #1184

Merged
merged 4 commits into from
May 14, 2020

Conversation

andreyvelich
Copy link
Member

I fixed some problems in Never resume policy.

  1. IsCompletedExperimentRestartable should return true only if ResumePolicy: LongRunning.
  2. Call terminateSuggestion only for ResumePolicy: Never.
  3. Reconcile can call terminateSuggestion twice. I added check if Suggestion is already succeeded.
  4. I believe we should call g.Status().Update() instead of g.Update() to update status of suggestion CR.
  5. terminateSuggestion just returns error, I don't think we need to requeue Reconcile: reconcile.Result{Requeue: true}.
  6. Fix some problems with Logging.
  7. Add e2e test for Never Resume Experiment.

/assign @johnugeorge @gaocegege
/cc @sperlingxx

Add e2e test for never resume
@kubeflow-bot
Copy link

This change is Reviewable

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gaocegege
Copy link
Member

/retest

Copy link
Member

@sperlingxx sperlingxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry for making so many bugs, and I am so working on this ^ ^.

@@ -193,17 +193,24 @@ func (r *ReconcileExperiment) Reconcile(request reconcile.Request) (reconcile.Re
if instance.IsCompleted() {
// Check if completed instance is restartable
// Experiment is restartable only if it is in succeeded state by reaching max trials
// And Resume Policy is LongRunning
if util.IsCompletedExperimentRestartable(instance) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this paragraph could be modified to something below

	if instance.IsCompleted() {
		needRestart := false
		// Check if completed instance is restartable
		// Experiment is restartable only if it is in succeeded state by reaching max trials
		if util.IsCompletedExperimentRestartable(instance) {
			// Check if max trials is reconfigured
			if (instance.Spec.MaxTrialCount != nil &&
				*instance.Spec.MaxTrialCount != instance.Status.Trials) ||
				(instance.Spec.MaxTrialCount == nil && instance.Status.Trials != 0) {
				msg := "Experiment is restarted"
				instance.MarkExperimentStatusRestarting(util.ExperimentRestartingReason, msg)
				needRestart = true
			}
		}
		// If experiment who doesn't need to restart is completed without running trials, stop reconcile
		if !needRestart && !instance.HasRunningTrials() {
			if instance.Spec.ResumePolicy != experimentsv1alpha3.LongRunning {
				return r.terminateSuggestion(instance)
			}
			return reconcile.Result{}, nil
		}
	}

Otherwise, terminateSuggestion won't work when corresponding experiment has been completed for reaching max trials.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sperlingxx Why terminateSuggestion won't run when experiment finishes with reaching max trials?

I believe, Reconcile loop runs with Succeeded experiment state, after reaching this step: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/util/status_util.go#L182.

I am wondering, do we actually need to check if Experiment has running trials in the controller @gaocegege @johnugeorge :
https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/experiment_controller.go#L209 ?
Or we can just return reconcile.Result{}, nil without this check if Experiment is Completed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I think util.IsCompletedExperimentRestartable will be true when experiment finishes with reaching max trials. But, for now, terminateSuggestion will only be called when util.IsCompletedExperimentRestartable is false.

Copy link
Member Author

@andreyvelich andreyvelich May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but util.IsCompletedExperimentRestartable returns True only if ResumePolicy: LongRunning.

https://github.com/kubeflow/katib/pull/1184/files#diff-f0faa0b63a35acff95fe5e9d93d594ffR201

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Oh, I got it!

@gaocegege
Copy link
Member

@sperlingxx Everyone creates bugs. Do not need to worry about it.

Copy link
Member Author

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sperlingxx No worries! Thank you for the review.

@@ -193,17 +193,24 @@ func (r *ReconcileExperiment) Reconcile(request reconcile.Request) (reconcile.Re
if instance.IsCompleted() {
// Check if completed instance is restartable
// Experiment is restartable only if it is in succeeded state by reaching max trials
// And Resume Policy is LongRunning
if util.IsCompletedExperimentRestartable(instance) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sperlingxx Why terminateSuggestion won't run when experiment finishes with reaching max trials?

I believe, Reconcile loop runs with Succeeded experiment state, after reaching this step: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/util/status_util.go#L182.

I am wondering, do we actually need to check if Experiment has running trials in the controller @gaocegege @johnugeorge :
https://github.com/kubeflow/katib/blob/master/pkg/controller.v1alpha3/experiment/experiment_controller.go#L209 ?
Or we can just return reconcile.Result{}, nil without this check if Experiment is Completed?

@andreyvelich
Copy link
Member Author

/retest

1 similar comment
@andreyvelich
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot removed the lgtm label May 13, 2020
@andreyvelich andreyvelich force-pushed the e2e-test-delete-suggestion branch from 4456981 to 863a80a Compare May 13, 2020 02:15
@andreyvelich
Copy link
Member Author

I think it was a problem with new version of Scikit Learn 0.23.
I created PR to change version to 0.22 in this PR: #1187.
I reverted this commit from that PR.

@sperlingxx
Copy link
Member

/approve

@sperlingxx
Copy link
Member

/lgtm

Copy link
Member Author

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, sperlingxx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 8aaf864 into kubeflow:master May 14, 2020
sperlingxx pushed a commit to sperlingxx/katib that referenced this pull request Jul 9, 2020
* Fix Never Resume Suggestion
Add e2e test for never resume

* Fix name for never resume in e2e

* Add permission on run never resume
@andreyvelich andreyvelich deleted the e2e-test-delete-suggestion branch October 6, 2021 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants