Possible race condition when removing a backup in New phase and backups overlapping #4738

adriananeci · 2022-03-14T12:27:28Z

What steps did you take and what happened:
We noticed we have a lot of backups in New phase on some highly loaded clusters. We believe it might be because we have a schedule configured to take a backup every 15m(*/15 * * * *), but the backup itself is taking more than 15m to complete (based on the diff between Started and Completed timestamps from velero backup describe) and this creates an overlapping between backups.

❯ velero backup get | grep New | wc -l
     504

❯ velero backup get | grep New
......................
hourly-k8s-backup-20220307151512   New          0        0          <nil>                           5d ago                       <none>
hourly-k8s-backup-20220307150012   New          0        0          <nil>                           5d ago                       <none>
hourly-k8s-backup-20220307140012   New          0        0          <nil>                           5d ago                       <none>
hourly-k8s-backup-20220307124512   New          0        0          <nil>                           5d ago                       <none>
hourly-k8s-backup-20220307113012   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307110012   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307090011   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307084511   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307081511   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307080011   New          0        0          <nil>                           6d ago                       <none>
hourly-k8s-backup-20220307064511   New          0        0          <nil>                           6d ago                       <none>
.....................

We tried to cleanup the backups that were in the New phase using

velero backup delete $(velero backup get | grep New | awk '{print $1}') --confirm

but no cleanup was actually done even if no errors were encountered during the above command, e.g.

..............
Request to delete backup "hourly-k8s-backup-20220308221528" submitted successfully.
The backup will be fully deleted after all associated data (disk snapshots, backup files, restores) are removed.
..............

Most probably this is because the backup-deletion plugin already removed the storage location since the schedule backup TTL is configured to 24h

❯ velero schedule get
NAME                STATUS    CREATED                         SCHEDULE       BACKUP TTL   LAST BACKUP   SELECTOR
hourly-k8s-backup   Enabled   2021-02-11 10:26:36 +0200 EET   */15 * * * *   24h0m0s      7m ago        <none>

and velero backup describe hourly-k8s-backup-20220307064511 shows:

Deletion Attempts (1 failed):
  2022-03-14 13:21:13 +0200 EET: Processed
  Errors:
    backup storage location  not found

What did you expect to happen:

Small or no backups in New phase.
No backups overlapping. If a backup didn't finish until the next schedule maybe skip the next schedule
Be able to remove backups in New phase even if the storage location was cleaned up by backup-deletion plugin. Maybe a --force flag might help.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

❯ velero backup describe hourly-k8s-backup-20220307064511
Name:         hourly-k8s-backup-20220307064511
Namespace:    velero
Labels:       velero.io/schedule-name=hourly-k8s-backup

Phase:  New

Errors:    0
Warnings:  0

Namespaces:
  Included:  *
  Excluded:  default, kube-public, kube-system, monitoring, kube-node-lease, ...

Resources:
  Included:        *
  Excluded:        events, events.events.k8s.io
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:

Velero-Native Snapshot PVs:  false

TTL:  24h0m0s

Hooks:  <none>

Backup Format Version:

Started:    <n/a>
Completed:  <n/a>

Expiration:  <nil>

Velero-Native Snapshots: <none included>

Deletion Attempts (1 failed):
  2022-03-14 13:21:13 +0200 EET: Processed
  Errors:
    backup storage location  not found

Anything else you would like to add:
Using azure plugin

Environment:

Velero version (use velero version):

❯ velero version
Client:
	Version: v1.8.0
	Git commit: -
Server:
	Version: v1.7.1

Velero features (use velero client config get features):

❯ velero client config get features
features: <NOT SET>

Kubernetes version (use kubectl version):

❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:30:48Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.11", GitCommit:"27522a29febbcc4badac257763044d0d90c11abd", GitTreeState:"clean", BuildDate:"2021-09-15T19:16:25Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes installer & version: v1.20
Cloud provider or hardware configuration: Azure
OS (e.g. from /etc/os-release): Flatcar

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

qiuming-best · 2022-03-21T03:15:40Z

@adriananeci BSL is been deleted, the backup will not be deleted automatically with server version under 1.8.0(not including 1.8.0), velero 1.8.0+ has the deletion mechanism.
But the backup with New Phase maybe couldn't be deleted, I'll check the codes later. (which means velero 1.8.0 may also couldn't delete backup with New phase).
No backups overlapping is a reasonable requirement, we will discuss it.

reasonerjt · 2022-03-21T07:30:51Z

There are two problems involved:

The scheduler should drop the next backup if there's a backup created by the same schedule running to avoid the overlap.
The backup deletion controller should be able to handle the deletion correctly when the backup is in New status.

reasonerjt · 2022-03-21T07:32:04Z

@adriananeci
Could you please provide the complete log bundle using velero debug? There may be some error in the controller to handle backup deletion.

adriananeci · 2022-05-04T14:35:12Z

Velero logs might contain sensitive information as I noticed, so not sure if I'll be able to provide a full bundle since we're noticing this problem only on big clusters, which usually are the production ones.
If there are any particular logs I should pain attention to I can check them. I've already parsed the entire logs stack and didn't find anything relevant.

On the other hand, I've manually removed the k8s backups objects associated with dangling backups(in New phase) using kubectl instead of velero backup delete(since it is not working) using something like:

kubectl delete backups -n velero $(velero backup get | grep New | awk '{print $1}')

stale · 2022-07-06T23:02:36Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

adriananeci · 2022-07-07T04:18:10Z

/remove-lifecycle-stale

blackpiglet · 2022-09-02T09:08:13Z

There are two problems involved:

The scheduler should drop the next backup if there's a backup created by the same schedule running to avoid the overlap.

The backup deletion controller should be able to handle the deletion correctly when the backup is in New status.

After investigation, I think the first issue problem should be fixed. Velero schedule controller should check whether there is backup labeled by the schedule's name and status.Phase is New and InProgress. If there is some backup, schedule should skip the next backup creation.

For the second problem, backup deletion controller can handle backup's status.Phase in New. Backup not deletion is due to BSL is not found. I think this is the supposed behavior. Fixing the first problem should resolve the issue.

velero/pkg/controller/backup_deletion_controller.go

Lines 177 to 191 in 218bab9

    
           // Don't allow deleting backups in read-only storage locations 
        
           location := &velerov1api.BackupStorageLocation{} 
        
           if err := r.Get(context.Background(), client.ObjectKey{ 
        
           	Namespace: backup.Namespace, 
        
           	Name:      backup.Spec.StorageLocation, 
        
           }, location); err != nil { 
        
           	if apierrors.IsNotFound(err) { 
        
           		_, err := r.patchDeleteBackupRequest(ctx, dbr, func(r *velerov1api.DeleteBackupRequest) { 
        
           			r.Status.Phase = velerov1api.DeleteBackupRequestPhaseProcessed 
        
           			r.Status.Errors = append(r.Status.Errors, fmt.Sprintf("backup storage location %s not found", backup.Spec.StorageLocation)) 
        
           		}) 
        
           		return ctrl.Result{}, err 
        
           	} 
        
           	return ctrl.Result{}, errors.Wrap(err, "error getting backup storage location") 
        
           }

qiuming-best assigned blackpiglet Mar 21, 2022

reasonerjt added the Area/schedule label Mar 21, 2022

reasonerjt added the Needs info Waiting for information label Mar 21, 2022

stale bot added the staled label Jul 6, 2022

blackpiglet removed the staled label Jul 7, 2022

blackpiglet mentioned this issue Sep 5, 2022

Add backup status checking in schedule controller. #5283

Merged

3 tasks

qiuming-best closed this as completed in #5283 Sep 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible race condition when removing a backup in New phase and backups overlapping #4738

Possible race condition when removing a backup in New phase and backups overlapping #4738

adriananeci commented Mar 14, 2022

qiuming-best commented Mar 21, 2022 •

edited

Loading

reasonerjt commented Mar 21, 2022

reasonerjt commented Mar 21, 2022

adriananeci commented May 4, 2022

stale bot commented Jul 6, 2022

adriananeci commented Jul 7, 2022

blackpiglet commented Sep 2, 2022

Possible race condition when removing a backup in New phase and backups overlapping #4738

Possible race condition when removing a backup in New phase and backups overlapping #4738

Comments

adriananeci commented Mar 14, 2022

qiuming-best commented Mar 21, 2022 • edited Loading

reasonerjt commented Mar 21, 2022

reasonerjt commented Mar 21, 2022

adriananeci commented May 4, 2022

stale bot commented Jul 6, 2022

adriananeci commented Jul 7, 2022

blackpiglet commented Sep 2, 2022

qiuming-best commented Mar 21, 2022 •

edited

Loading