Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero Restore Hooks Product Requirements Document #2679

Closed
wants to merge 2 commits into from

Conversation

stephbman
Copy link
Contributor

Velero Restore Hooks - PRD (Product Requirements Document)

Relates to: #2116

Change tracking

This is a live document, you can reach me on the following channels for more information or for any questions:

Relates to Git Issues:

Background

Velero supports restore operations but there are gaps in the process. Gaps in the restore process require users to manually carry out steps to start, clean up, and end the restore process. Other gaps in the restore process can cause issues with application performance for applications running in a pod when a restore operation is carried out.

On a restore, Velero currently does not include hooks to execute a pre- or -post restore script. As a result, users are required to perform additional actions following a velero restore operation. Some gaps that currently exist in the Velero restore process are:

  • Users can create a restore operation but has no option to customize or automate commands during the start of the restore operation
  • Users can perform post restore operations but have no option to customize or automate commands during the end of the restore operation

Strategic Fit

Adding a restore hook action today would allow Velero to unpack the data that was backed up in an automated way by enabling Velero to execute commands in containers during a restore. This will also improve the restore operations on a container and mitigate against any negative performance impacts of apps running in the container during restore.

Purpose / Goal

The purpose of this feature is to improve the extensibility and user experience of pre and post restore operations for Velero users.

Goals for this feature include:

  • Provide pre-restore hooks for customizing start restore operations
  • Provide actions for things like retry actions, event logging, etc... during restore operations
  • Provide observability/status of restore commands run in restored pods
  • Extend restore logs to include status, error codes and necessary metadata for restore commands run during restore operations for enhanced troubleshooting capabilities
  • Provide post-restore hooks for customizing end of restore operations

Non-goals

Feature Description

This feature will automate the restore operations/processes in Velero, and will provide restore hook actions in Velero that allows users to execute restore commands on a container. Restore hooks include pre-hook actions and post-hook actions.

This feature will mirror the actions of a Velero backup by allowing Velero to check for any restore hooks specified for a pod.

Assumptions

Use Cases

The following use cases must be included as part of the Velero restore hooks MVP (minimum viable product).

USE CASE 1
**Title: **Create restore pre-hook
**Description:**As a user, I would like to run Velero pre-hook for performing restore operations on a container at the start of a restore operation.
**Functional Requirements:**The restore pre-hook should allow the user to run the command on the container where the pre-hook should be executed. Similar to the backup hooks, this hook should run to default to fun on the first container in the pod.
**Note: **If the user does not want to the hook to default to the first container in the pod, the user should be able to specify which container on which to run the container restore hook.


USE CASE 2
**Title: **Automate setting backup storage location to read-only on restore start.
**Description: **As a user, at the start of a restore operation for a specified backup name, I would like to automatically set the backup storage location to ‘read-only’ mode prior to the start of the ‘velero restore create --from-backup command executing.


USE CASE 3

**Title: **Automate to default to most recent backup snapshot use on restore as optional setting.
**Description:**As a user, once a restore operation has started, I would like velero to create the restore using the most recent Velero backups snapshot by default.


USE CASE 4
**Title: **Annotate specific backup snapshot use on restore with snapshots.
**Description: **As a user, I would like the option to specify a specific backup snapshot for use by Velero during restore create. I would like to do this instead of using the default most recent backup restore snapshot.


USE CASE 5
**Title: **Restore all resources by default.
**Description: **As a user, I would like to include all resources in namespaces contained in a backup by default in my restore spec.
If velero is asked to restore something that already exists in a pod, the restore will not return a success but will still work - Ashish needs to verify.


USE CASE 6
**Title: **Exclude resources from restore.
**Description: **As a user, in my restore spec, I would like to annotate specific namespaces to exclude from a restore.


USE CASE 7
**Title: **Restore of a stateful application (Unquiescing data from a quiesced backup)
**Description: **As a user, I would like to unquiesce data during a restore to prevent the need to shut down the database and disrupt the application end user experience.


USE CASE 8
**Title: **Display/surface restore status
Description: As a user, I would like to see the status of my restore status surfaced from the pod volume restore status.


USE CASE 9
**Title: **Retry restore upon restore failure/error/timeout
**Description: **As a user, if I see that a restore has failed, I would like for Velero to retry the restore operations using the restore specification.
Retry should happen to support the following failure/error scenarios for a restore:

  • Restore timeout (set max timeout on retry restore operation, set max number of retry attempts)
    • Retry specified restore operations after timeout of xx ms
    • Retry specified restore operations after x number of retry attempts
  • **Question: **Restore error - could not access specified backup
    • Display backup access or backup error log
    • Allow users to say ‘yes’ to allowing Velero to pick up the next most recent backup
    • Allow user to select backup (if specific backup was previously specified)
  • Restore error (could not access backup, could not access namespace, could not access resource)
    • QUESTION - what does velero do in these cases??? Permanent error - restore operation could not be performed??? (potential beyond MVP use case)

USE CASE 10
**Title: **Return backup storage location to read-write mode
**Description: **As a user, once the restore is complete, I would like Velero to automatically revert the backup storage location to read-write mode.


Use Case 11
**Title: **Delete all restore objects by default.
**Description: **As a user, I would like to delete all CRs associated the restore as part of clean-up operations following the restore.


User Experience

The following is representative of what the user experience could look like for Velero restore pre-hooks and post-hooks.

Note: These examples are representative and are not to be considered for us in pre- and post- restore hook operations until the technical design is complete.

Restore Pre-Hooks

Container Command

pre.hook.restore.velero.io/container

kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
    --namespace velero \
    --type merge \
    --patch '{"spec":{"accessMode":"ReadOnly"}}'

Command Execute

pre.hook.restore.velero.io/command

Includes commands for:

  • Create

    • Create from most recent backup
    • Create from specific backup - allow user to list backups
  • Set backup storage location to read only

    kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
        --namespace velero \
        --type merge \
        --patch '{"spec":{"accessMode":"ReadOnly"}}
    
  • Set backup storage location to read-write

    kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
        --namespace velero \
        --type merge \
        --patch '{"spec":{"accessMode":"ReadWrite"}}'
    
    

Error handling

pre.hook.restore.velero.io/on-error

Timeout

pre.hook.restore.velero.io/retry

Requirements

P0 = must not ship without

(absolute requirement for MVP, engineering requirements for long term viability usually fall in here for ex., and incompletion nearing or by deadline means delaying code freeze and GA date)

P1 = should not ship without

(required for feature to achieve general adoption, should be scoped into feature but can be pushed back to later iterations if needed)

P2 = nice to have

(opportunistic, should not be scoped into the overall feature if it has dependencies and could cause delay)

P0. Running restore hook at the start of a restore operation.

P0. Automate setting backup storage location to read-only on restore start.

P0. Automate to default to most recent backup snapshot use on restore as optional setting.

P0. Include all resources in namespaces contained in a backup by default in my restore operation.

P0. Restore of a stateful application (Unquiescing data from a quiesced backup.)

P0. Retry restore upon restore failure/error/timeout.

P0. Return backup storage location to read-write mode.

P0. Delete all restore objects by default.

P1. ** **Annotate specific backup snapshot use on restore with snapshots.

P1. Exclude resources from restore.

P1. Display/surface restore status.

Out of scope

Verifying the integrity of a backup, resource, or other artifact will not be included in the scope of this effort.

Questions

  1. For USE CASE 1: Init vs app containers - if multiple containers are specified for a pod kubelet will run each init container sequentially - does this have an impact on things like concurrent workload processing?
  2. Can velero allow a user to specify a specific backup if the most recent backup is not desired in a restore?
  3. If a backup specified for a restore operation fails, can velero retry and pick up the next most recent backup in the restore?
  4. Can velero provide a delta between the two backups if a different backup needs to be picked up (other than the most recent because the most recent backup cannot be accessed?)
  5. What types of errors can velero surface about backups, namespaces, pods, resources, if a backup has an issue with it preventing a restore from being done?

For questions, please contact michaelmi@vmware.com, bstephanie@vmware.com

@stephbman stephbman changed the title Volume Backup by Default when Restic Enabled Velero Restore Hooks Product Requirements Document Jun 30, 2020
@stephbman stephbman linked an issue Jun 30, 2020 that may be closed by this pull request
@stephbman stephbman marked this pull request as draft June 30, 2020 21:57
@stephbman stephbman self-assigned this Jun 30, 2020
@stephbman stephbman marked this pull request as ready for review June 30, 2020 22:03
@stephbman stephbman marked this pull request as draft July 1, 2020 19:58
@carlisia carlisia added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Jul 1, 2020
@carlisia
Copy link
Contributor

carlisia commented Jul 7, 2020

USE CASE 2
**Title: **Automate setting backup storage location to read-only on restore start.
**Description: **As a user, at the start of a restore operation for a specified backup name, I would like to automatically set the backup storage location to ‘read-only’ mode prior to the start of the ‘velero restore create --from-backup command executing.

USE CASE 10
**Title: **Return backup storage location to read-write mode
**Description: **As a user, once the restore is complete, I would like Velero to automatically revert the backup storage location to read-write mode.

We should take into account that any scheduled backup for this backup storage location will fail during a restore that would have this turned on. Also, could we please get some insight for why this is needed?

@stephbman stephbman closed this Jul 7, 2020
@stephbman
Copy link
Contributor Author

closing to recreate as this one had errors. Pulling in comments to new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restore Hooks product requirements document
2 participants