Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition when removing a backup in New phase and backups overlapping #4738

Closed
adriananeci opened this issue Mar 14, 2022 · 7 comments · Fixed by #5283
Closed
Assignees
Labels
Area/schedule Needs info Waiting for information

Comments

@adriananeci
Copy link

What steps did you take and what happened:
We noticed we have a lot of backups in New phase on some highly loaded clusters. We believe it might be because we have a schedule configured to take a backup every 15m(*/15 * * * *), but the backup itself is taking more than 15m to complete (based on the diff between Started and Completed timestamps from velero backup describe) and this creates an overlapping between backups.

❯ velero backup get | grep New | wc -l
     504

❯ velero backup get | grep New
......................
hourly-k8s-backup-20220307151512   New          0        0          <nil>                           5d ago                       <none>
hourly-k8s-backup-20220307150012   New          0        0          <nil>                           5d ago                       <none>
hourly-k8s-backup-20220307140012   New          0        0          <nil>                           5d ago                       <none>
hourly-k8s-backup-20220307124512   New          0        0          <nil>                           5d ago                       <none>
hourly-k8s-backup-20220307113012   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307110012   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307090011   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307084511   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307081511   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307080011   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307064511   New          0        0          <nil>                           6d ago                       <none>
.....................

We tried to cleanup the backups that were in the New phase using

velero backup delete $(velero backup get | grep New | awk '{print $1}') --confirm

but no cleanup was actually done even if no errors were encountered during the above command, e.g.

..............
Request to delete backup "hourly-k8s-backup-20220308221528" submitted successfully.
The backup will be fully deleted after all associated data (disk snapshots, backup files, restores) are removed.
..............

Most probably this is because the backup-deletion plugin already removed the storage location since the schedule backup TTL is configured to 24h

❯ velero schedule get
NAME                STATUS    CREATED                         SCHEDULE       BACKUP TTL   LAST BACKUP   SELECTOR
hourly-k8s-backup   Enabled   2021-02-11 10:26:36 +0200 EET   */15 * * * *   24h0m0s      7m ago        <none>

and velero backup describe hourly-k8s-backup-20220307064511 shows:

Deletion Attempts (1 failed):
  2022-03-14 13:21:13 +0200 EET: Processed
  Errors:
    backup storage location  not found

What did you expect to happen:

  • Small or no backups in New phase.
  • No backups overlapping. If a backup didn't finish until the next schedule maybe skip the next schedule
  • Be able to remove backups in New phase even if the storage location was cleaned up by backup-deletion plugin. Maybe a --force flag might help.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>
❯ velero backup describe hourly-k8s-backup-20220307064511
Name:         hourly-k8s-backup-20220307064511
Namespace:    velero
Labels:       velero.io/schedule-name=hourly-k8s-backup

Phase:  New

Errors:    0
Warnings:  0

Namespaces:
  Included:  *
  Excluded:  default, kube-public, kube-system, monitoring, kube-node-lease, ...

Resources:
  Included:        *
  Excluded:        events, events.events.k8s.io
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:

Velero-Native Snapshot PVs:  false

TTL:  24h0m0s

Hooks:  <none>

Backup Format Version:

Started:    <n/a>
Completed:  <n/a>

Expiration:  <nil>

Velero-Native Snapshots: <none included>

Deletion Attempts (1 failed):
  2022-03-14 13:21:13 +0200 EET: Processed
  Errors:
    backup storage location  not found

Anything else you would like to add:
Using azure plugin

Environment:

  • Velero version (use velero version):
❯ velero version
Client:
	Version: v1.8.0
	Git commit: -
Server:
	Version: v1.7.1
  • Velero features (use velero client config get features):
❯ velero client config get features
features: <NOT SET>
  • Kubernetes version (use kubectl version):
❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:30:48Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.11", GitCommit:"27522a29febbcc4badac257763044d0d90c11abd", GitTreeState:"clean", BuildDate:"2021-09-15T19:16:25Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version: v1.20
  • Cloud provider or hardware configuration: Azure
  • OS (e.g. from /etc/os-release): Flatcar

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@qiuming-best
Copy link
Contributor

qiuming-best commented Mar 21, 2022

@adriananeci BSL is been deleted, the backup will not be deleted automatically with server version under 1.8.0(not including 1.8.0), velero 1.8.0+ has the deletion mechanism.
But the backup with New Phase maybe couldn't be deleted, I'll check the codes later. (which means velero 1.8.0 may also couldn't delete backup with New phase).
No backups overlapping is a reasonable requirement, we will discuss it.

@reasonerjt
Copy link
Contributor

There are two problems involved:

  1. The scheduler should drop the next backup if there's a backup created by the same schedule running to avoid the overlap.
  2. The backup deletion controller should be able to handle the deletion correctly when the backup is in New status.

@reasonerjt
Copy link
Contributor

@adriananeci
Could you please provide the complete log bundle using velero debug? There may be some error in the controller to handle backup deletion.

@reasonerjt reasonerjt added the Needs info Waiting for information label Mar 21, 2022
@adriananeci
Copy link
Author

Velero logs might contain sensitive information as I noticed, so not sure if I'll be able to provide a full bundle since we're noticing this problem only on big clusters, which usually are the production ones.
If there are any particular logs I should pain attention to I can check them. I've already parsed the entire logs stack and didn't find anything relevant.

On the other hand, I've manually removed the k8s backups objects associated with dangling backups(in New phase) using kubectl instead of velero backup delete(since it is not working) using something like:

kubectl delete backups -n velero $(velero backup get | grep New | awk '{print $1}')

@stale
Copy link

stale bot commented Jul 6, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the staled label Jul 6, 2022
@adriananeci
Copy link
Author

/remove-lifecycle-stale

@blackpiglet blackpiglet removed the staled label Jul 7, 2022
@blackpiglet
Copy link
Contributor

There are two problems involved:

  1. The scheduler should drop the next backup if there's a backup created by the same schedule running to avoid the overlap.
  2. The backup deletion controller should be able to handle the deletion correctly when the backup is in New status.

After investigation, I think the first issue problem should be fixed. Velero schedule controller should check whether there is backup labeled by the schedule's name and status.Phase is New and InProgress. If there is some backup, schedule should skip the next backup creation.

For the second problem, backup deletion controller can handle backup's status.Phase in New. Backup not deletion is due to BSL is not found. I think this is the supposed behavior. Fixing the first problem should resolve the issue.

// Don't allow deleting backups in read-only storage locations
location := &velerov1api.BackupStorageLocation{}
if err := r.Get(context.Background(), client.ObjectKey{
Namespace: backup.Namespace,
Name: backup.Spec.StorageLocation,
}, location); err != nil {
if apierrors.IsNotFound(err) {
_, err := r.patchDeleteBackupRequest(ctx, dbr, func(r *velerov1api.DeleteBackupRequest) {
r.Status.Phase = velerov1api.DeleteBackupRequestPhaseProcessed
r.Status.Errors = append(r.Status.Errors, fmt.Sprintf("backup storage location %s not found", backup.Spec.StorageLocation))
})
return ctrl.Result{}, err
}
return ctrl.Result{}, errors.Wrap(err, "error getting backup storage location")
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/schedule Needs info Waiting for information
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants