Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container Volumes not showing in vcenter after CSI driver upgrade #2863

Open
NelSat opened this issue Apr 22, 2024 · 5 comments
Open

Container Volumes not showing in vcenter after CSI driver upgrade #2863

NelSat opened this issue Apr 22, 2024 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@NelSat
Copy link

NelSat commented Apr 22, 2024

/kind bug

What happened:

We have 5 kubernetes clusters running on rancher/RKE with vsphere CSI driver installed. We recently upgraded our clusters to kubernetes 1.24. We then decided to upgrade the CPI to 1.24, and the vsphere CSI from 2.6.4 to 3.0.3. Most clusters seem to be running ok after the upgrade, but in one specific cluster we ran into some trouble after a few days. We realized some statefulsets stopped working with errors like "Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[dshm data]: timed out waiting for the condition". While checking the monitor tab in vcenter, we realized all PV's from that specific cluster weren't being displayed anymore. The vsphere-syncer container shows the following errors in its log:

{"level":"error","time":"2024-04-22T22:25:14.807503295Z","caller":"migration/migration.go:225","msg":"failed to register volume for volumeSpec: &{/verdaccio/storage }, with err: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\"","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/apis/migration.(*volumeMigration).GetVolumeID\n\t/build/pkg/apis/migration/migration.go:225\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.CsiFullSync\n\t/build/pkg/syncer/fullsync.go:98\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.InitMetadataSyncer.func11\n\t/build/pkg/syncer/metadatasyncer.go:549"}
{"level":"error","time":"2024-04-22T22:25:14.807520119Z","caller":"syncer/fullsync.go:100","msg":"FullSync for VC pvcn01.sof.intra: Failed to get VolumeID from volumeMigrationService for spec: &{/verdaccio/storage }. Err: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\"","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.CsiFullSync\n\t/build/pkg/syncer/fullsync.go:100\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.InitMetadataSyncer.func11\n\t/build/pkg/syncer/metadatasyncer.go:549"}
{"level":"info","time":"2024-04-22T22:25:14.807542385Z","caller":"syncer/metadatasyncer.go:551","msg":"CSI full sync failed with error: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\""}
{"level":"info","time":"2024-04-22T22:55:14.797778219Z","caller":"syncer/metadatasyncer.go:536","msg":"fullSync is triggered"}
{"level":"info","time":"2024-04-22T22:55:14.79886185Z","caller":"syncer/fullsync.go:46","msg":"FullSync for VC pvcn01.sof.intra: start"}
{"level":"error","time":"2024-04-22T22:55:14.805566227Z","caller":"migration/migration.go:449","msg":"failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\"","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/apis/migration.(*volumeMigration).registerVolume\n\t/build/pkg/apis/migration/migration.go:449\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/apis/migration.(*volumeMigration).GetVolumeID\n\t/build/pkg/apis/migration/migration.go:223\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.CsiFullSync\n\t/build/pkg/syncer/fullsync.go:98\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.InitMetadataSyncer.func11\n\t/build/pkg/syncer/metadatasyncer.go:549"}
{"level":"error","time":"2024-04-22T22:55:14.805718644Z","caller":"migration/migration.go:225","msg":"failed to register volume for volumeSpec: &{/verdaccio/storage }, with err: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\"","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/apis/migration.(*volumeMigration).GetVolumeID\n\t/build/pkg/apis/migration/migration.go:225\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.CsiFullSync\n\t/build/pkg/syncer/fullsync.go:98\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.InitMetadataSyncer.func11\n\t/build/pkg/syncer/metadatasyncer.go:549"}
{"level":"error","time":"2024-04-22T22:55:14.805769516Z","caller":"syncer/fullsync.go:100","msg":"FullSync for VC pvcn01.sof.intra: Failed to get VolumeID from volumeMigrationService for spec: &{/verdaccio/storage }. Err: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\"","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.CsiFullSync\n\t/build/pkg/syncer/fullsync.go:100\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.InitMetadataSyncer.func11\n\t/build/pkg/syncer/metadatasyncer.go:549"}
{"level":"info","time":"2024-04-22T22:55:14.805823601Z","caller":"syncer/metadatasyncer.go:551","msg":"CSI full sync failed with error: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\""}

Looking back at vcenter events, right after the update we started getting lots of "Sync volume" and "Attach container volume" events from the nodes belonging to the problematic cluster.

I googled similar errors that ocurred in much older versions of CSI, but none of the workarounds, like reconciling the CNS catalog, worked for me. Also, from what I understand, all in-tree volumes should have been migrated to CSI long before this last update, since we were using the 2.6.4 version with migration enabled.

Tried updating CSI to 3.0.3rc1, but only thing that changed was the frequency of the fullsync calls by the syncer.

I'm able to create new volumes, and they all show in vcenter normally.

What you expected to happen:

CSI migration to run smoothly.

How to reproduce it (as minimally and precisely as possible):

Upgraded CPI to 1.24 via helm chart, and then CSI driver to 3.0.3 by applying the manifests according to vmware documentation.

Anything else we need to know?:
Attacher logs:

I0422 23:25:51.516461       1 controller.go:167] Ignoring VolumeAttachment "csi-60dadb2af1c06faf14306ee6cbad41d4202178189ac514a2f776f5b66a9dd129" change
I0422 23:25:51.517738       1 csi_handler.go:607] Saving detach error to "csi-5e8922476ef04758f8a871589d4c12c18fb9dd999fcbfdcf9dff4e40477fe085"
I0422 23:25:51.519699       1 csi_handler.go:618] Saved detach error to "csi-bd4c20e084df667d96b57fc08842d7faa9d36d01894163805e4da34469e09e9f"
I0422 23:25:51.519743       1 csi_handler.go:234] Error processing "csi-bd4c20e084df667d96b57fc08842d7faa9d36d01894163805e4da34469e09e9f": failed to detach: rpc error: code = Internal desc = volumeID "ce02b2b4-425b-4eb9-903b-9356f3360c0f" not found in QueryVolume
I0422 23:25:51.520500       1 controller.go:167] Ignoring VolumeAttachment "csi-bd4c20e084df667d96b57fc08842d7faa9d36d01894163805e4da34469e09e9f" change
I0422 23:25:51.522020       1 csi_handler.go:607] Saving detach error to "csi-9972fa01b2c76c5dce6ca2a68202eea7f3c2a4ca99c6aad0dcbff9755db9cf79"
I0422 23:25:51.522652       1 controller.go:167] Ignoring VolumeAttachment "csi-47a42b20c1708c01cad92d6f99c3f2bbb0e7c281be6df195a8a90ca6cd528268" change
I0422 23:25:51.522784       1 csi_handler.go:618] Saved detach error to "csi-47a42b20c1708c01cad92d6f99c3f2bbb0e7c281be6df195a8a90ca6cd528268"
I0422 23:25:51.522827       1 csi_handler.go:234] Error processing "csi-47a42b20c1708c01cad92d6f99c3f2bbb0e7c281be6df195a8a90ca6cd528268": failed to detach: rpc error: code = Internal desc = volumeID "eb1c60de-5225-482d-a0c1-8081f708ed99" not found in QueryVolume
I0422 23:25:51.526304       1 csi_handler.go:607] Saving detach error to "csi-17c07cd0c45986e88f9d5b1648ca024f9dca183897845cdb4a591de3e17c60d4"
I0422 23:25:51.526340       1 csi_handler.go:618] Saved detach error to "csi-5e8922476ef04758f8a871589d4c12c18fb9dd999fcbfdcf9dff4e40477fe085"
I0422 23:25:51.526362       1 csi_handler.go:234] Error processing "csi-5e8922476ef04758f8a871589d4c12c18fb9dd999fcbfdcf9dff4e40477fe085": failed to detach: rpc error: code = Internal desc = volumeID "9bd79636-37f4-4022-90ef-56109d9aac1f" not found in QueryVolume
I0422 23:25:51.527388       1 controller.go:167] Ignoring VolumeAttachment "csi-5e8922476ef04758f8a871589d4c12c18fb9dd999fcbfdcf9dff4e40477fe085" change
I0422 23:25:51.529637       1 csi_handler.go:618] Saved detach error to "csi-9972fa01b2c76c5dce6ca2a68202eea7f3c2a4ca99c6aad0dcbff9755db9cf79"
I0422 23:25:51.529673       1 csi_handler.go:234] Error processing "csi-9972fa01b2c76c5dce6ca2a68202eea7f3c2a4ca99c6aad0dcbff9755db9cf79": failed to detach: rpc error: code = Internal desc = volumeID "0f472b9d-b6ac-4c48-9f37-4aa963b41154" not found in QueryVolume
I0422 23:25:51.530407       1 controller.go:167] Ignoring VolumeAttachment "csi-9972fa01b2c76c5dce6ca2a68202eea7f3c2a4ca99c6aad0dcbff9755db9cf79" change
I0422 23:25:51.530998       1 csi_handler.go:618] Saved detach error to "csi-60dadb2af1c06faf14306ee6cbad41d4202178189ac514a2f776f5b66a9dd129"
I0422 23:25:51.531043       1 csi_handler.go:234] Error processing "csi-60dadb2af1c06faf14306ee6cbad41d4202178189ac514a2f776f5b66a9dd129": failed to detach: rpc error: code = Internal desc = volumeID "d6184ece-e53a-4851-95dd-8c7a9605842b" not found in QueryVolume
I0422 23:25:51.532438       1 csi_handler.go:618] Saved detach error to "csi-17c07cd0c45986e88f9d5b1648ca024f9dca183897845cdb4a591de3e17c60d4"
I0422 23:25:51.532471       1 csi_handler.go:234] Error processing "csi-17c07cd0c45986e88f9d5b1648ca024f9dca183897845cdb4a591de3e17c60d4": failed to detach: rpc error: code = Internal desc = volumeID "b436a8f2-5c26-4243-8355-15699d91d5c6" not found in QueryVolume
I0422 23:25:51.533601       1 controller.go:167] Ignoring VolumeAttachment "csi-17c07cd0c45986e88f9d5b1648ca024f9dca183897845cdb4a591de3e17c60d4" change
I0422 23:25:54.565248       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
E0422 23:25:54.578944       1 controller.go:146] Failed to reconcile volume attachments: failed to ListVolumes: failed to list volumes: rpc error: code = FailedPrecondition desc = difference between number of K8s volumes and CNS volumes is greater than threshold.
I0422 23:26:44.549739       1 reflector.go:559] k8s.io/client-go/informers/factory.go:150: Watch close - *v1.CSINode total 4 items received
I0422 23:26:54.581024       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
E0422 23:26:54.594484       1 controller.go:146] Failed to reconcile volume attachments: failed to ListVolumes: failed to list volumes: rpc error: code = FailedPrecondition desc = difference between number of K8s volumes and CNS volumes is greater than threshold.

csi-controller logs

{"level":"error","time":"2024-04-22T23:25:51.521555326Z","caller":"vanilla/controller.go:2215","msg":"volumeID \"0f472b9d-b6ac-4c48-9f37-4aa963b41154\" not found in QueryVolume","TraceId":"512f0c70-c74c-4e41-bff7-ea2d6dc2be21","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2215\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:2286\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5725\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:25:51.521673344Z","caller":"vanilla/controller.go:2289","msg":"Operation failed, reporting failure status to Prometheus. Operation Type: \"detach-volume\", Volume Type: \"unknown\", Fault Type: \"csi.fault.Internal\"","TraceId":"512f0c70-c74c-4e41-bff7-ea2d6dc2be21","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:2289\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5725\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:25:51.524805416Z","caller":"vanilla/controller.go:2215","msg":"volumeID \"b436a8f2-5c26-4243-8355-15699d91d5c6\" not found in QueryVolume","TraceId":"b1d8b022-30c6-4d6a-8b93-3e58502fe953","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2215\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:2286\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5725\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:25:51.52607625Z","caller":"vanilla/controller.go:2289","msg":"Operation failed, reporting failure status to Prometheus. Operation Type: \"detach-volume\", Volume Type: \"unknown\", Fault Type: \"csi.fault.Internal\"","TraceId":"b1d8b022-30c6-4d6a-8b93-3e58502fe953","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:2289\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5725\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"info","time":"2024-04-22T23:25:54.568463081Z","caller":"config/config.go:404","msg":"No Net Permissions given in Config. Using default permissions.","TraceId":"cb048756-458a-40a8-8988-9f3c6cbf74dc"}
{"level":"error","time":"2024-04-22T23:25:54.578449047Z","caller":"vanilla/controller.go:2519","msg":"difference between number of K8s volumes: 54, and CNS volumes: 0, is greater than threshold: 50, and completely out of sync.","TraceId":"cb048756-458a-40a8-8988-9f3c6cbf74dc","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2519\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:25:54.578566327Z","caller":"vanilla/controller.go:2522","msg":"difference between number of K8s volumes and CNS volumes is greater than threshold.","TraceId":"cb048756-458a-40a8-8988-9f3c6cbf74dc","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2522\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:25:54.578630482Z","caller":"vanilla/controller.go:2560","msg":"Operation failed, reporting failure status to Prometheus. Operation Type: \"list-volume\", Volume Type: \"unknown\", Fault Type: \"csi.fault.Internal\"","TraceId":"cb048756-458a-40a8-8988-9f3c6cbf74dc","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2560\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"info","time":"2024-04-22T23:26:54.582476526Z","caller":"config/config.go:404","msg":"No Net Permissions given in Config. Using default permissions.","TraceId":"bca4ef1a-7b86-4cd8-80e6-64c186d8c0da"}
{"level":"error","time":"2024-04-22T23:26:54.593711122Z","caller":"vanilla/controller.go:2519","msg":"difference between number of K8s volumes: 54, and CNS volumes: 0, is greater than threshold: 50, and completely out of sync.","TraceId":"bca4ef1a-7b86-4cd8-80e6-64c186d8c0da","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2519\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:26:54.593825996Z","caller":"vanilla/controller.go:2522","msg":"difference between number of K8s volumes and CNS volumes is greater than threshold.","TraceId":"bca4ef1a-7b86-4cd8-80e6-64c186d8c0da","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2522\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:26:54.593916529Z","caller":"vanilla/controller.go:2560","msg":"Operation failed, reporting failure status to Prometheus. Operation Type: \"list-volume\", Volume Type: \"unknown\", Fault Type: \"csi.fault.Internal\"","TraceId":"bca4ef1a-7b86-4cd8-80e6-64c186d8c0da","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2560\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"info","time":"2024-04-22T23:27:54.598308233Z","caller":"config/config.go:404","msg":"No Net Permissions given in Config. Using default permissions.","TraceId":"8b8e00f8-1a4c-4c58-aa30-f20235996203"}
{"level":"error","time":"2024-04-22T23:27:54.609811978Z","caller":"vanilla/controller.go:2519","msg":"difference between number of K8s volumes: 54, and CNS volumes: 0, is greater than threshold: 50, and completely out of sync.","TraceId":"8b8e00f8-1a4c-4c58-aa30-f20235996203","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2519\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:27:54.609907854Z","caller":"vanilla/controller.go:2522","msg":"difference between number of K8s volumes and CNS volumes is greater than threshold.","TraceId":"8b8e00f8-1a4c-4c58-aa30-f20235996203","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2522\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:27:54.609957844Z","caller":"vanilla/controller.go:2560","msg":"Operation failed, reporting failure status to Prometheus. Operation Type: \"list-volume\", Volume Type: \"unknown\", Fault Type: \"csi.fault.Internal\"","TraceId":"8b8e00f8-1a4c-4c58-aa30-f20235996203","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2560\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"info","time":"2024-04-22T23:28:01.648430291Z","caller":"common/authmanager.go:279","msg":"No vSAN datastores found for vCenter \"pvcn01.sof.intra\"","TraceId":"c3ceb22e-97e0-4f34-b1a0-5e6b695b80fb"}
{"level":"info","time":"2024-04-22T23:28:01.648484693Z","caller":"common/authmanager.go:183","msg":"auth manager: newFsEnabledClusterToDsMap is updated to map[] for vCenter \"pvcn01.sof.intra\"","TraceId":"c3ceb22e-97e0-4f34-b1a0-5e6b695b80fb"}
{"level":"info","time":"2024-04-22T23:28:01.661403831Z","caller":"common/authmanager.go:163","msg":"auth manager: datastoreMapForBlockVolumes is updated to map[ds:///vmfs/volumes/65e9caaa-47491870-6801-d06726c49670/:Datastore: Datastore:datastore-221625, datastore URL: ds:///vmfs/volumes/65e9caaa-47491870-6801-d06726c49670/ ds:///vmfs/volumes/65fce6e8-53512aaa-a749-d06726c49670/:Datastore: Datastore:datastore-223172, datastore URL: ds:///vmfs/volumes/65fce6e8-53512aaa-a749-d06726c49670/ ds:///vmfs/volumes/66220883-eb34d14e-f70f-d06726c49670/:Datastore: Datastore:datastore-229535, datastore URL: ds:///vmfs/volumes/66220883-eb34d14e-f70f-d06726c49670/ ds:///vmfs/volumes/6622d6ec-b0d08c4a-5469-d06726c49670/:Datastore: Datastore:datastore-229556, datastore URL: ds:///vmfs/volumes/6622d6ec-b0d08c4a-5469-d06726c49670/] for vCenter \"pvcn01.sof.intra\"","TraceId":"93b7ed52-ed2c-48fd-be65-a31b77396e7f"}
{"level":"info","time":"2024-04-22T23:28:01.662262621Z","caller":"k8sorchestrator/topology.go:239","msg":"Refreshing preferred datastores information...","TraceId":"349eb887-de9f-4863-aa8b-75c673449a3a"}
{"level":"info","time":"2024-04-22T23:28:01.662975948Z","caller":"config/config.go:404","msg":"No Net Permissions given in Config. Using default permissions.","TraceId":"349eb887-de9f-4863-aa8b-75c673449a3a"}
{"level":"info","time":"2024-04-22T23:28:01.663038928Z","caller":"vsphere/utils.go:265","msg":"Defaulting timeout for vCenter Client to 5 minutes","TraceId":"349eb887-de9f-4863-aa8b-75c673449a3a"}
{"level":"warn","time":"2024-04-22T23:28:02.11051213Z","caller":"common/topology.go:159","msg":"failed to retrieve tags for category \"cns.vmware.topology-preferred-datastores\" in vCenter \"pvcn01.sof.intra\". Reason: GET https://pvcn01.sof.intra:443/rest/com/vmware/cis/tagging/category/id:cns.vmware.topology-preferred-datastores: 404 Not Found","TraceId":"349eb887-de9f-4863-aa8b-75c673449a3a"}

Environment:

  • csi-vsphere version: 3.0.3rc1
  • vsphere-cloud-controller-manager version: 1.24
  • Kubernetes version: 1.24.17
  • vSphere version: 8.0.2
  • OS (e.g. from /etc/os-release): centOS 7.9
  • Kernel (e.g. uname -a): 3.10.0-1160.11.1.el7.x86_64
  • Install tools: Helm chart + manifests
  • Others:
@divyenpatel divyenpatel added the kind/bug Categorizes issue or PR as related to a bug. label Apr 24, 2024
@NelSat
Copy link
Author

NelSat commented Apr 25, 2024

Update: after a day or so, newly created volumes also disappeared from vcenter container volumes list.

@sathieu
Copy link

sathieu commented Jul 3, 2024

probably same as #2802.

This may be related to csi-provisionner enabling topology support by default: kubernetes-csi/external-provisioner#1167

@sathieu
Copy link

sathieu commented Jul 3, 2024

It seems that topology support is now enabled by default, but according to docs :

If you already use vSphere Container Storage Plug-in to run applications, but haven't used the topology feature, you must re-create the entire cluster and delete any existing PVCs in the system to be able to use the topology feature.

So this looks like a breaking change. Can we disable topology with topology-preferential-datastores=false?

@divyenpatel any idea?

@NelSat
Copy link
Author

NelSat commented Jul 3, 2024

Update: after trying a lot of different things to recover the volume catalog in vcenter, we just gave up and manually migrated our data from the broken vsphere volumes to other kinds of storage. Then we did a full purge of the vsphere cpi/csi and all the vsphere volumes from that cluster, and re-installed it all using the rancher helm charts. So far it's working. We were completely clueless about topology settings being the root cause, since we never used it and none of the error messages we found led us this way.

@agowa
Copy link

agowa commented Sep 2, 2024

Hi, we hit this issue today after enabling topology for working around another issue.

The fingerprint of this problem is basically No vSAN datastores found for vCenter in the csi-controller logs even though it should work. After that check your node labels using this command: kubectl get nodes -o yaml | grep -v "name: kubernetes.io/csi" - | grep -e " name: " -e "topology.csi.vmware.com" -.
The output should be:

        topology.csi.vmware.com/cluster-category: yourClusterCategoryTag
        topology.csi.vmware.com/datacenter-category: yourDatacenterCategoryTag
    name: nodeName001

what we found was that these were not set for some nodes (NOTE: DO NOT SET THEM MANUALLY).

The issue can be fixed by:

  1. Ensure you've set the tags correctly and the nodes are in fact on a host within the cluster and/or datacenter where the tag is defined.
  2. restart all of the vsphere-csi-nodes using e.g. kubectl rollout -n vmware-system-csi restart DaemonSet vsphere-csi-node
  3. Check your node labels again. Now all of them should be set properly. If not you may have some permission issues in your cluster and the vsphere-csi-nodes were unable to update them or it failed to query them from vsphere. (NOTE: Don't worry about the https://vsphere.fqdn/rest/com/vmware/cis/tagging/category/id:cns.vmware.topology-preferred-datastores: 404 Not Found error, it's unrelated to this issue and will work regardless...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants