Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume Experiment from Volume #1275

Merged

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Jul 21, 2020

Fixes: #1250.

This is implementation for resuming experiment from the Volume.

Steps:

  1. I added new ResumePolicy=FromVolume which represents resuming experiment from the volume.
    @gaocegege @johnugeorge @sperlingxx what do you think about the name? Should we rename it to ResumePolicy=VolumeSource or think about better name, maybe rename LongRunning also?


  2. I added ResumePolicy to Suggestion API to understand if controller needs to deploy volume for the Suggestion.
    Do we need to add some volume specification to Suggestion API also.
    For example: StorageClassName, PVC name ?

  3. If ResumePolicy=VolumeSource, PVC is created. If StorageClassName=DefaultSuggestionStorageClass, PV with local cluster path is created also.

  4. I didn’t attach Experiment labels to PVC and PV, like we did here: https://github.com/andreyvelich/katib/blob/issue-1250-resume-from-volume/pkg/controller.v1beta1/suggestion/composer/composer.go#L74 (it was done to connect service with deployment).
    Do we want to see these labels in the PVC/PV?

  5. Currently, for the PV default cluster local volume path is: prefix + <suggestion name>-<suggestion namespace>. We need to add name and namespace to not conflict with other submitted Experiments in various namespaces.
    Prefix can be changed in katib-config later.

  6. Currently, for suggestion container default path is /opt/katib/data. Also can be changed in katib-config.

  7. PVC name = <suggestion name>

  8. PV name = <suggestion name>-<suggestion namespace>

  9. I added new validation for the webhook. That is needed to prevent submitting Experiment with the same name, if corresponding PV was not deleted after deleting Experiment.
    We can't Watch for the PV, like for the PVC. And suggestion controller can't be triggered and create PV again, once PV is deleted from the cluster.

  10. I had to change suggestion status workflow to allow resuming experiment from the volume.
    This change doesn't affect current workflow for LongRunning Suggestion.
    For the ResumePolicy=VolumeSource:
    When User submits Experiment, Suggestion status:

Created (status = True) -> DeploymentReady (status = False) -> DeploymentReady (status = True) -> Running (status = True) -> (experiment is finished) -> Running (status = False) -> DeploymentReady (status = False) -> Succeeded (status = True) .

When User restarts Experiment, Suggestion status:

Running (status = True, reason="Experiment is restarting") -> (remove succeeded condition) -> DeploymentReady (status = True) -> Running (status = True, reason="SuggestionRunning")) -> (experiment is finished) -> Running (status = False) -> DeploymentReady (status = False) -> Succeeded (status = True).

This approach can avoid adding additional condition for Suggestion: Restarting.

What do you think @gaocegege @johnugeorge @sperlingxx ?

I still need to make some testing, but this PR is ready for review.
I will add unit test and e2e test in the future PR.

/assign @gaocegege @johnugeorge @sperlingxx

/cc @jlewi

@kubeflow-bot
Copy link

This change is Reviewable

@andreyvelich
Copy link
Member Author

I tested example, everything was working.
I will remove WIP status.

Since we are using PV with local host path after deletion Suggestion PV and PVC, directory on cluster will be not deleted.
I am not sure that we can/should clean-up directory in host node's filesystem.

@andreyvelich andreyvelich changed the title [WIP] Resume Experiment from Volume Resume Experiment from Volume Jul 22, 2020
@sperlingxx
Copy link
Member

sperlingxx commented Jul 23, 2020

Hi @andreyvelich , after a careful review, I am still confused about how we persist/retrieve suggestion instance to/from PV. Since in restartSuggestion function, we fetch suggestion instance via kubernetes client directly.

@andreyvelich
Copy link
Member Author

@sperlingxx Thank you for the review.

When user submits Experiment, we deploy Suggestion service and deployment with attached PVC on it.
When Experiment is succeeded, we mark Suggestion succeeded and delete deployment and service:

if instance.IsSucceeded() {
err = r.deleteDeployment(instance)
if err != nil {
return reconcile.Result{}, err
}
err = r.deleteService(instance)
if err != nil {
return reconcile.Result{}, err
}
return reconcile.Result{}, nil
}
.
We don't delete PVC and it saves information from the Suggestion deployment.
Ones user wants to restart Experiment, we change Suggestion status:
suggestion.MarkSuggestionStatusRunning(corev1.ConditionFalse, reason, msg)
.
Then it automatically runs ReconcileSuggestion
func (r *ReconcileSuggestion) ReconcileSuggestion(instance *suggestionsv1beta1.Suggestion) error {
and it creates Deployment and Service that we have deleted before.
PV/PVC is not created again because we didn't delete it.

If user deletes Experiment, it automatically clean-up all Suggestion resources (deployment/service/pvc/pv) since we have owner reference for each object.

reason := "Experiment is succeeded"
// If ResumePolicy = Never, mark suggestion status succeeded, can't be restarted
if instance.Spec.ResumePolicy == experimentsv1beta1.NeverResume {
msg := "Suggestion is succeeded, can't be restarted"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we add restart message when suggestion is terminated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for additional information after Suggestion is succeeded for the user.
@johnugeorge You think it is not necessary?

@johnugeorge
Copy link
Member

  1. Why not?

@sperlingxx
Copy link
Member

@andreyvelich Thanks for detailed explanation. So, we assume suggestion programs can recover from disk (or maybe they are stateless) in FromVolume mode ?

@andreyvelich
Copy link
Member Author

andreyvelich commented Jul 24, 2020

@andreyvelich Thanks for detailed explanation. So, we assume suggestion programs can recover from disk (or maybe they are stateless) in FromVolume mode ?

Yes, later add functionality to set StorageClassName from the katib-config. To allow user set provisioner for the Suggestion volume.

@andreyvelich
Copy link
Member Author

  1. Why not?

@johnugeorge Not sure if it is needed for the Volume, since we have Owner Reference for the volume: https://github.com/andreyvelich/katib/blob/issue-1250-resume-from-volume/pkg/controller.v1beta1/suggestion/composer/composer.go#L288.

@andreyvelich
Copy link
Member Author

andreyvelich commented Jul 24, 2020

I renamed PVC to -, PV to -- to be consistent with Suggestion's deployment and service.
I think user can easily understand which k8s resources related to Suggestion.
/cc @johnugeorge @gaocegege

@sperlingxx
Copy link
Member

StorageClassName from

Cool!
/lgtm

@johnugeorge
Copy link
Member

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jul 27, 2020
@johnugeorge
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot merged commit 9a7d43c into kubeflow:master Jul 27, 2020
@andreyvelich andreyvelich deleted the issue-1250-resume-from-volume branch October 6, 2021 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save Suggestion state in persistent volume
6 participants