-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: ETCD-295: Automated Backups of Etcd #1370
Conversation
@hasbro17: This pull request references ETCD-295 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
||
To enable automated backups a new singleton CRD can be used to specify the backup schedule and other configuration as well as monitor the status of the backup operations. | ||
|
||
A new controller in the cluster-etcd-operator would then reconcile this CRD to save backups locally per the specified configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nitpick, I think we should make the existing UpgradeBackupController
more general instead of writing an entirely new controller.
edit: that's also covered below - thanks!
// e.g retain 500mb of backups, or discard backups older than a month | ||
// If we want to extend this API in the future we should make retention | ||
// a type so we can add more optional fields later | ||
RetentionCount int `json: "retentionCount"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we already have good ideas on other types of retention, maybe add a struct for this? I know it's TP MVP, but let's not introduce something we have to completely change in v1alpha2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we already have good ideas on other types of retention, maybe add a struct for this? I know it's TP MVP, but let's not introduce something we have to completely change in v1alpha2
Agreed, you want to get close on your first attempt and this looks ripe for discriminated union.
// TODO: Validation on min and max durations. | ||
// The max should be smaller than the scheduled interval | ||
// for the next backup else the alert will never be generated | ||
MaxDurationToCompleteBackup string `json: "maxDurationWithoutBackup"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The max should be smaller than the scheduled interval
do you think that's easy to get out of the cron syntax?
I would maybe leave this out for now and just degrade the operator like we're currently doing in UpgradeBackupController. Taking a snapshot shouldn't take longer than a minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no preference for a timeout as long as the operator reports degraded.
|
||
To enforce the specified schedule we can utilize [CronJobs](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/) to run backup pods. | ||
|
||
**TODO:** If load balancing between control-plane nodes is a requirement then it needs to be determined how we can achieve that with the Job or Pod spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, that's an interesting issue we wouldn't have with a single mounted PVC.
Let's say I have RetentionCount = 3, how do we know where to find and delete the respective snapshot on which CP node?
|
||
The procedure also requires gaining a root shell on a control plane | ||
node. Shell access to OpenShift control plane nodes access is generally | ||
discouraged due to the potential for affecting the reliability of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond discouraged, it is not possible for hosted control planes in a managed environment where hosted cluster admin does not administer the env in which the hosted control plane (and etcd) is running.
|
||
- One-time cluster backup can be triggered without requiring a root shell | ||
- Scheduled backups can be configured after cluster installation | ||
- Backups are saved locally on the host filesystem of control-plane nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work for hosted control planes?
- can the admin in the hosted cluster, who doesn't manage the control plane, initiate this backup?
- if they can what protections are in place to ensure backups do not impact the hosting node in a way that may negaively impact other control plane pods hosted on the node where the backup is saved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this enhancement to cover only self-hosted. Hosted control planes (hypershift), took separate ownership of etcd by developing their own operator. (see https://github.com/openshift/hypershift/blob/a0377ae793b20fb565ac8852ee4db1e5bb4780da/control-plane-operator/controllers/hostedcontrolplane/etcd/reconcile.go#L1). Hypershift will have to developer their own distinct backup and restore mechanism.
The problem-space for a self-hosted cluster without external management is very different than a hosted cluster with external management. Both in terms of observability of the situation and in terms of cluster state reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Separate ownership" today, doesn't mean it needs to be separate forever, does it? It seems like it would be possible to move towards a unified approach. For example, in a hosted control plane world, it's not clear to me which persona(s) would have which access to backups. Possibly the Cluster Service Consumer, if we lump backup restoration in as similar to "drive upgrades". Or possibly the Cluster Service Provider, if we lump backup restoration in under "Has cluster admin management". But either way, I'd expect etcd backup/restore tooling to live on the management cluster and not leak out to the Cluster Instance ...
roles. Which would argue for a namespaced CRD, so a management cluster could have these resources in each of its per-hosted-cluster namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting take, I would rather see more separation instead of more unification. For example, we rebased etcd earlier this year and discovered it broke a customer cluster. checkout etcd-io/etcd#15062
We didn't even know that Hypershift was using openshift/etcd as their etcd. They are certainly not using CEO, which makes reusing the backup CRD moot as they would have to implement their own business logic for backups anyway.
CEO is also currently a singleton in the cluster (don't think we want to change that either), so namespacing that CRD would make little sense in OCP.
- Save cluster backups to PersistentVolumes or cloud storage | ||
- This would be a future enhancement or extension to the API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be required for adoption in managed openshift and hosted control planes to mitigate possible risks. For automated backups to be useful for managed it should be usable by the end customer. But the end customer is not managing the control plane and do not have control on the configuration of the host volume, so there is a risk we may not be able to accept depending on how much impact those backups have.
- Have automated backups enabled by default with cluster installation | ||
- Save cluster backups to PersistentVolumes or cloud storage | ||
- This would be a future enhancement or extension to the API | ||
- Automate cluster restoration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected as a future enhancement? Not stated here as in the prior bullet.
### User Stories | ||
|
||
- As a cluster administrator I want to initiate a cluster backup without requiring a root shell on a control plane node so as to minimize the risk involved | ||
- As a cluster administrator I want to schedule recurring cluster backups so that I have a recent cluster state to recover from in the event of quorum loss (i.e. losing a majority of control-plane nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many backups are retained when there is this schedule?
|
||
### API Extensions | ||
|
||
A new CRD `EtcdBackup` will be introduced to the API group `config.openshift.io` with the version `v1alpha1`. The CRD would be feature-gated until the prerequisite e2e test has an acceptable pass rate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Backup
with a region for etcd and possibly a future region for other items. Velero, PVs, somethng else.
// The default "" means no backup is triggered | ||
// +kubebuilder:default:="" | ||
// +optional | ||
Reason string `json:"reason"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our existing operators, we have a forceRedeploymentReason
or some such, right? Given your need to track backups, it may be more practical to have an object like the Insights collection that indicates a desire to run a backup "now".
// Default: "" | ||
// TODO: Define how we do validation for the format (e.g validation webhooks) | ||
// and the limits on the frequency to disallow unrealistic schedules (e.g */2 * * * * every 2 mins) | ||
// TODO: What is the behavior if the backup doesn't complete in the specified interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having an object like Insights collection helps here too, since it gives an "obvious" item to delete. Versus finding a specific instance of a job in a managed namespace.
// every day at 3am: 0 3 * * * | ||
// Setting to an empty string "" means disabling scheduled backups | ||
// Default: "" | ||
// TODO: Define how we do validation for the format (e.g validation webhooks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try CEL first
|
||
### API Extensions | ||
|
||
A new CRD `EtcdBackup` will be introduced to the API group `config.openshift.io` with the version `v1alpha1`. The CRD would be feature-gated until the prerequisite e2e test has an acceptable pass rate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glad to see alpha here.
``` | ||
|
||
**TODO:** How do we feature gate the entire type? | ||
For a field within a non-gated type it seems to be the tag [`// +openshift:enable:FeatureSets=TechPreviewNoUpgrade`](https://github.com/openshift/api/blob/master/example/v1/types_stable.go#L33). No such tag visible on the [whole type gated example](https://github.com/openshift/api/blob/1957a8d7445bf2332f027f93a24d7573f77a0dc0/example/v1alpha1/types_notstable.go#L19). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example of a gate on an entire type here: https://github.com/openshift/api/blob/master/example/v1alpha1/0000_50_notstabletype-techpreview.crd.yaml
|
||
// Conditions represents the observations of the EtcdBackup's current state. | ||
// TODO: Identify different condition types/reasons that will be needed. | ||
// +optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the actual PR, you'll need the SSA tags.
// EtcdBackup. It corresponds to the EtcdBackup's .metadata.generation that the condition was set based upon. | ||
// This lets us know whether the latest condition is stale. | ||
// +optional | ||
ObservedGeneration int64 `json:"observedGeneration,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hasn't been done in our config API before. I suggest not starting now.
|
||
#### Execution method | ||
|
||
Create a backup pod that runs the backup script on a designated master node to save the backup file on the node's host filesystem: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the job construct seems like a good choice here. I don't object to a pod, but are you sure you don't want a job instead?
- If significant, should we avoid selecting the leader to reduce the risk of overloading the leader and triggering a leader election? | ||
- We can find the leader via the [endpoint status](https://etcd.io/docs/v3.5/dev-guide/api_reference_v3/#message-statusresponse-apietcdserverpbrpcproto) but the leader can change during the course of the backup or we may not have the option of filtering the leader out if the backup pod is unschedulable on non-leader nodes (e.g low disk space). | ||
- Other than the leader IO considerations, when picking where to run and save the backup, do we need to load balance or round-robin across all control-plane nodes? | ||
- Leaving it to the scheduler to schedule the backup pod (or having the operator save on the node it's running on) is simpler, but we could end up with all/most backups being saved on a single node. In a disaster recovery scenario we could lose that node and the backups saved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a force assignment policy of some kind is a good choice.
|
||
### Test Plan | ||
|
||
Prior to merging the backup controller implementation, an e2e test would be added to openshift/origin to test the validity of the backups generated from the automated backups feature. The pass rate on this test will feature gate the API until we pass an acceptable threshold to make it available by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gating by techpreview means you can merge your impl at will.
|
||
### Test Plan | ||
|
||
Prior to merging the backup controller implementation, an e2e test would be added to openshift/origin to test the validity of the backups generated from the automated backups feature. The pass rate on this test will feature gate the API until we pass an acceptable threshold to make it available by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
promoting to "accessible-by-default" will be allowed once there is an acceptable history of passing restore
|
||
#### Tech Preview -> GA | ||
|
||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is known. Track record of reliable restore. I propose 99% on all platforms as a starting point since 99% of our tests currently pass at this threshold and consequences of failed restore is extremely high.
Some updates:
The most significant change however is splitting up the original EtcdBackup API and controller into two different APIs and controllers so we can better track the status for individual backup requests without lumping them into conditions on the config API (#1370 (comment)):
For now another look at the updated Workflow and API sections would be helpful to know whether the two APIs work better for configuration and status tracking. @deads2k I still have some other ToDos to follow up. Primarily #1370 (comment) to figure out on how we would enforce the retention policy for backup files spread across all nodes but will update in another follow up. |
7e20250
to
5ccb1a1
Compare
Removed `MaxDurationToCompleteBackup` alert in favor of simply making the backup controller go degraded if it notices a backup job didn’t complete in some default timeout. Replaced RetentionCount int with a discriminated union field RetentionPolicy which supports retention by count or size. Leaves us open to adding more (by time) later. Added some CEL validation rules for API spec Clarified that non-self hosted clusters like Hypershift are out of scope for the enhancement Added a section on backup timeout failure and setting a degraded condition. Graduation criteria states the feature gate and 99% pass rate requirement. Expanded the Test Plan section a little bit with how we want to validate a successful restore. Split up the original EtcdBackup API and controller into two different APIs and controllers EtcdBackupRequest for handling one time backup requests so we can better track the status of multiple past backups EtcdBackupSchedule for configuring the schedule for recurring backups which outsources the actual backup requests via EtcdBackupRequest
5ccb1a1
to
4e5b439
Compare
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen Going to update with the changes from the API PR so far. |
@hasbro17: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@hasbro17: This pull request references ETCD-295 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@hasbro17: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@hasbro17 This design is linked to a jira ticket that is in progress. Is the design complete? Who's sign-off do you need to merge this PR? |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
@dhellmann My apologies, I didn't get a chance to update this. We've had some design changes over on the API PR that we've already merged (openshift/api#1482) and we're almost finishing up the implementation. So the design for 4.14 is complete but I need to document the API changes back here and update the proposal for the follow ups and future goals. Will ping the reviewers once I have that done. /remove-lifecycle rotten |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, ETCD-81, has status "Testing". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
1 similar comment
(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, ETCD-81, has status "Testing". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, ETCD-81, has status "Testing". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
1 similar comment
(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, ETCD-81, has status "Testing". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
Enhancement proposal for enabling automated backups of etcd in Openshift.
See:
https://issues.redhat.com/browse/ETCD-295
https://issues.redhat.com/browse/ETCD-81
https://issues.redhat.com/browse/OCPBU-252
/hold
For further iteration on outstanding TODOs.