[BUG]: Isolation Mechanism for CSI Driver in Shared Storage Pool environments #1606
Labels
needs-triage
Issue requires triage.
type/bug
Something isn't working. This is the default label associated with a bug issue.
Milestone
Bug Description
We encountered a critical issue with the CSI driver for PowerFlex storage. A faulty backup script caused the driver to fill a storage pool with persistent volume snapshots, taking the pool offline. This poses a significant risk in environments where a shared storage pool is used. For example, if this had occurred on a 10PB PowerMax system with a Kubernetes cluster intended to use only 500TB, it could have disrupted thousands of VMs relying on the remaining 9.5PB because of the lack of boundaries.
The CSI driver has direct control over the Storage Resource Pool (SRP), and currently, there appears to be no mechanism to isolate the driver to a specific part of the storage pool without creating a new SRP.
Logs
Snapshot overrun looks like this, and without a cleanup mechanism can consume all available capacity quickly.
k8s-7aa62eb054 44671607000001b2 198.18.184.22 f10b6eb600000180 16777216 ThinProvisioned
sn-8b12755c-b3a8-4a03-87a5-686e 446719ba000001e6 None f10b6eb600000180 16777216 Snapshot
sn-115352d5-c642-4952-83d8-834d 44671e8300000218 None f10b6eb600000180 16777216 Snapshot
sn-e3fda0ea-38f0-453d-ba7e-84b1 4467213d00000211 None f10b6eb600000180 16777216 Snapshot
sn-1361c500-6264-4600-b851-6a17 446721e400000216 None f10b6eb600000180 16777216 Snapshot
sn-6938be33-1ef7-488a-89a7-3e20 4467245f000001cf None f10b6eb600000180 16777216 Snapshot
sn-e6cf7c1c-28ce-4160-96d3-e084 44672670000002c0 None f10b6eb600000180 16777216 Snapshot
sn-7539ff51-e4a8-4fde-8f4f-ad40 446727b200000308 None f10b6eb600000180 16777216 Snapshot
sn-08b93f64-c835-4f29-9353-506f 4467296b00000349 None f10b6eb600000180 16777216 Snapshot
sn-4ca01182-ddab-4ac7-a633-1d44 44672c5c0000035e None f10b6eb600000180 16777216 Snapshot
sn-94997ed0-4405-439d-b40b-a067 44672e8100000282 None f10b6eb600000180 16777216 Snapshot
Screenshots
No response
Additional Environment Information
No response
Steps to Reproduce
Configure the CSI driver with access to a shared storage pool.
Use a backup script or process that creates persistent volume snapshots in an unregulated manner.
Observe the storage pool's behavior as it reaches capacity.
Expected Behavior
The entire shared storage pool will reach capacity, taking all storage offline not just the container environment.
CSM Driver(s)
v1.12.0
Installation Type
No response
Container Storage Modules Enabled
No response
Container Orchestrator
Kubernetes
Operating System
Ubuntu 22.04 LTS 5.15.0-101-generic
The text was updated successfully, but these errors were encountered: