Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem removing protection finalizer (Related to issue #7068) #7433

Closed
gh-tek opened this issue Feb 16, 2024 · 6 comments
Closed

problem removing protection finalizer (Related to issue #7068) #7433

gh-tek opened this issue Feb 16, 2024 · 6 comments
Assignees
Labels
Area/CSI Related to Container Storage Interface support area/datamover
Milestone

Comments

@gh-tek
Copy link

gh-tek commented Feb 16, 2024

What steps did you take and what happened:
Backing up using volume snapshot with data movement to S3 storage, backup ends "PartiallyFailed". Snapshot data is not moved (fails on expose already), snapshot content resources are set to be deleted, but they are not removed, because removal of protection finalizer fails, preventing removal.

Error msg states that expose fails, protection removal actually tries to set wrong (unrelated) field spec.SourceVolumeMode on snapshot content (clip from log below point's to the code trying that removal, it's pretty recent code)

What did you expect to happen:
There should not be change attempts on immutable fields ever, I guess that information is lost somewhere, or expected that it is unset (in my case it is 'Filesystem', not nil). Protection removal should succeed, snapshot content be removed and backup end with success.

The following information will help us better understand what's going on:

Debug bundle has too much information to be put out publicly, but here is node-agent log containing full error:

time="2024-02-15T07:57:05Z" level=info msg="Reconcile testbackup-kjfdv" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:05Z" level=warning msg="failed to add finalizer for velero/testbackup-kjfdv and will requeue later" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:144"
time="2024-02-15T07:57:05Z" level=info msg="Reconcile testbackup-kjfdv" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:05Z" level=info msg="Data upload starting" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:161"
time="2024-02-15T07:57:05Z" level=info msg="Accepting data upload testbackup-kjfdv" logSource="pkg/controller/data_upload_controller.go:706"
time="2024-02-15T07:57:05Z" level=info msg="This datauplod has been accepted by others" Dataupload=testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:735"
time="2024-02-15T07:57:05Z" level=info msg="Reconcile testbackup-kjfdv" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:08Z" level=info msg="Reconcile testbackup-kjfdv" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:08Z" level=info msg="Reconcile testbackup-kjfdv" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:10Z" level=info msg="Reconcile testbackup-wpltz" controller=dataupload dataupload=velero/testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:10Z" level=info msg="Data upload starting" controller=dataupload dataupload=velero/testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:161"
time="2024-02-15T07:57:10Z" level=info msg="Accepting data upload testbackup-wpltz" logSource="pkg/controller/data_upload_controller.go:706"
time="2024-02-15T07:57:10Z" level=info msg="This datauplod has been accepted by ibuki01" Dataupload=testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:731"
time="2024-02-15T07:57:10Z" level=info msg="Data upload is accepted" controller=dataupload dataupload=velero/testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:173"
time="2024-02-15T07:57:10Z" level=info msg="Exposing CSI snapshot" logSource="pkg/exposer/csi_snapshot.go:101" owner=testbackup-wpltz
time="2024-02-15T07:57:10Z" level=info msg="Volumesnapshot is ready" logSource="pkg/exposer/csi_snapshot.go:108" owner=testbackup-wpltz
time="2024-02-15T07:57:10Z" level=info msg="Got VSC from VS in namespace postgres" logSource="pkg/exposer/csi_snapshot.go:115" owner=testbackup-wpltz vs name=velero-pgdata-pg-passbolt-0-s2mn2 vsc name=snapcontent-496987f0-cd44-445e-a680-962ba2dd65f3
time="2024-02-15T07:57:10Z" level=info msg="Finished to retain VSC" logSource="pkg/exposer/csi_snapshot.go:122" owner=testbackup-wpltz retained=true vsc name=snapcontent-496987f0-cd44-445e-a680-962ba2dd65f3
time="2024-02-15T07:57:13Z" level=info msg="VS is deleted in namespace postgres" logSource="pkg/exposer/csi_snapshot.go:135" owner=testbackup-wpltz vs name=velero-pgdata-pg-passbolt-0-s2mn2
1.7079838330763552e+09	ERROR	Reconciler error	{"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "dataUpload": {"name":"testbackup-wpltz","namespace":"velero"}, "namespace": "velero", "name": "testbackup-wpltz", "reconcileID": "c7239a65-f7e3-43ee-9ca3-5eeaa410cc94", "error": "error to remove protect from volume snapshot content: error to update VolumeSnapshotContent snapcontent-496987f0-cd44-445e-a680-962ba2dd65f3: admission webhook \"snapshot-validation-webhook.snapshot.storage.k8s.io\" denied the request: Spec.SourceVolumeMode is immutable but was changed from Filesystem to nil", "errorVerbose": "admission webhook \"snapshot-validation-webhook.snapshot.storage.k8s.io\" denied the request: Spec.SourceVolumeMode is immutable but was changed from Filesystem to nil\nerror to update VolumeSnapshotContent snapcontent-496987f0-cd44-445e-a680-962ba2dd65f3\ngithub.com/vmware-tanzu/velero/pkg/util/csi.RemoveVSCProtect.func1\n\t/go/src/github.com/vmware-tanzu/velero/pkg/util/csi/volume_snapshot.go:170\nk8s.io/apimachinery/pkg/util/wait.PollImmediate.ConditionFunc.WithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.6/pkg/util/wait/wait.go:222\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.6/pkg/util/wait/wait.go:235\nk8s.io/apimachinery/pkg/util/wait.poll\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.6/pkg/util/wait/wait.go:582\nk8s.io/apimachinery/pkg/util/wait.PollImmediateWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.6/pkg/util/wait/wait.go:528\nk8s.io/apimachinery/pkg/util/wait.PollImmediate\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.6/pkg/util/wait/wait.go:514\ngithub.com/vmware-tanzu/velero/pkg/util/csi.RemoveVSCProtect\n\t/go/src/github.com/vmware-tanzu/velero/pkg/util/csi/volume_snapshot.go:156\ngithub.com/vmware-tanzu/velero/pkg/exposer.(*csiSnapshotExposer).Expose\n\t/go/src/github.com/vmware-tanzu/velero/pkg/exposer/csi_snapshot.go:137\ngithub.com/vmware-tanzu/velero/pkg/controller.(*DataUploadReconciler).Reconcile\n\t/go/src/github.com/vmware-tanzu/velero/pkg/controller/data_upload_controller.go:188\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\nerror to remove protect from volume snapshot content\ngithub.com/vmware-tanzu/velero/pkg/exposer.(*csiSnapshotExposer).Expose\n\t/go/src/github.com/vmware-tanzu/velero/pkg/exposer/csi_snapshot.go:139\ngithub.com/vmware-tanzu/velero/pkg/controller.(*DataUploadReconciler).Reconcile\n\t/go/src/github.com/vmware-tanzu/velero/pkg/controller/data_upload_controller.go:188\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:234
time="2024-02-15T07:57:13Z" level=info msg="Reconcile testbackup-wpltz" controller=dataupload dataupload=velero/testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:13Z" level=info msg="Reconcile testbackup-wpltz" controller=dataupload dataupload=velero/testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:113"

Snapshot content is left in state:

Name:         snapcontent-496987f0-cd44-445e-a680-962ba2dd65f3
Namespace:    
Labels:       <none>
Annotations:  snapshot.storage.kubernetes.io/deletion-secret-name: rook-csi-cephfs-provisioner
              snapshot.storage.kubernetes.io/deletion-secret-namespace: rook-ceph
              snapshot.storage.kubernetes.io/volumesnapshot-being-deleted: yes
API Version:  snapshot.storage.k8s.io/v1
Kind:         VolumeSnapshotContent
Metadata:
  Creation Timestamp:             2024-02-15T07:57:05Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2024-02-15T07:57:13Z
  Finalizers:
    velero.io/volume-snapshot-content-protect-finalizer

...

Spec:
  Deletion Policy:  Retain
  Driver:           rook-ceph.cephfs.csi.ceph.com
  Source:
    Volume Handle:             0001-0009-rook-ceph-0000000000000001-9abde5e0-6841-49d3-a146-abc0dc3f7e07
  Source Volume Mode:          Filesystem
  Volume Snapshot Class Name:  rook-ceph-snapclass
  Volume Snapshot Ref:
    API Version:       snapshot.storage.k8s.io/v1
    Kind:              VolumeSnapshot
    Name:              velero-pgdata-pg-passbolt-0-s2mn2
    Namespace:         postgres
    Resource Version:  31363782
    UID:               496987f0-cd44-445e-a680-962ba2dd65f3
Status:
  Creation Time:    1707983826107097000
  Ready To Use:     true
  Restore Size:     10737418240
  Snapshot Handle:  0001-0009-rook-ceph-0000000000000001-d352618f-11fb-484f-9406-560161f3a123
Events:             <none>

Anything else you would like to add:

This could well be configuration error on my side (which I haven't figured out yet), but I think that trying to change immutable field on snapshot content resource may be some kind of bug, that should not happen... I'm also using the latest external snapshotted on my cluster, with latest snapshot CRD's.

Also, I have verified that snapshotting works when you use --snapshot-move-data=false. Snapshot is saved and is restorable. But when trying to use default snapshot mover, it fails before moving anything (it fails on expose already).

This is a new install, I have never used Velero before, so I have no experience from older versions. But snapshotting works without data move and also FSB works (but I would rather use snapshotting with data move).

Environment:

  • Velero version (use velero version): 1.13.0, csi add-on 0.7.0
  • Velero features (use velero client config get features): EnableCSI
  • Kubernetes version (use kubectl version): 1.28
  • Kubernetes installer & version: K3S
  • Cloud provider or hardware configuration: On premise with rook-ceph CephFS storage (rook-ceph
  • rook-ceph 1.13.4, snapshot-controller 7.0.1

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@gh-tek
Copy link
Author

gh-tek commented Feb 16, 2024

Shortly after writing this, I figured out that this is related to a rather recent change on external snapshotter. I have version 7.0.1. https://github.com/kubernetes-csi/external-snapshotter/tree/v7.0.1

I found out that there is a parameter --prevent-volume-mode-conversion and its default value has recently changed from false to true as that feature is progressing. That actually forces the validation for the aforementioned field.

I was able to give --prevent-volume-mode-conversion=false and then Velero started working, backup went just fine.

So I guess this may not be a bug, but more of a feature request. Velero probably needs some changes because of that change in snapshotter. I guess it should work with default setting too.

Anyway, backup works for me right now, but if you need help with testing this, I can help with that.

@Lyndon-Li
Copy link
Contributor

This is due to the low SDK version velero is using where Spec.SourceVolumeMode field doesn't exist. So the problem only happens in the environment that uses external snapshotter v7.0 and higher.

@Lyndon-Li Lyndon-Li added this to the v1.14 milestone Feb 19, 2024
@Lyndon-Li Lyndon-Li added the Area/CSI Related to Container Storage Interface support label Feb 19, 2024
@ksyblast
Copy link

ksyblast commented Feb 21, 2024

We have the same issue with storage class driver linstor.csi.linbit.com
Operation Error: error to expose snapshot: error to remove protect from volume snapshot content: error to update VolumeSnapshotContent snapcontent-2a34e711-1149-4dd5-a80a-2161bac0fca6: admission webhook "snapshot-validation-webhook.snapshot.storage.k8s.io" denied the request: Spec.SourceVolumeMode is immutable but was changed from Filesystem to nil

However the issue is floating and sometimes backup works. No idea why it's still working from time to time

velero version: v1.12.2, csi addon 0.7.0
K8s version: v1.27.5, bare metal
CSI dirver: piraeusdatastore CSI driver v1.3.0, snapshot-controller 7.0.1

@ywk253100
Copy link
Contributor

Hi @gh-tek @ksyblast, I upgraded several packages used by Velero to fix this issue, but we have no environment with the latest version of the snapshot-controller installed. Do you have any chance to help us verify the fix with the images yinw/velero:lib-bump01 and yinw/velero-plugin-for-csi:lib-bump01?

@gh-tek
Copy link
Author

gh-tek commented Mar 11, 2024

Hi @ywk253100,

Sorry for the delay, gmail desided to throw my emails to Junk and I missed your message :)

I tested your images:

I have been on Velero 1.13, csi plugin 0.7.0 and external snapshotter 6.3.3. Which is a working setup.

I upgraded external snapshotter to 7.0.1 and tested taking backup. It failed as expected.
I then switched to the bump -images you gave and retested same backup. They now work flawlessly with newest external snapshotter!

Based on this test, should be ok to merge. I will now go back to old version and wait for actual release. Thanks!

@ywk253100
Copy link
Contributor

Fixed by #7489

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/CSI Related to Container Storage Interface support area/datamover
Projects
None yet
Development

No branches or pull requests

7 participants