Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Isolation Mechanism for CSI Driver in Shared Storage Pool environments #1606

Open
Bvreela opened this issue Nov 22, 2024 · 4 comments
Assignees
Labels
needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue.
Milestone

Comments

@Bvreela
Copy link

Bvreela commented Nov 22, 2024

Bug Description

We encountered a critical issue with the CSI driver for PowerFlex storage. A faulty backup script caused the driver to fill a storage pool with persistent volume snapshots, taking the pool offline. This poses a significant risk in environments where a shared storage pool is used. For example, if this had occurred on a 10PB PowerMax system with a Kubernetes cluster intended to use only 500TB, it could have disrupted thousands of VMs relying on the remaining 9.5PB because of the lack of boundaries.

The CSI driver has direct control over the Storage Resource Pool (SRP), and currently, there appears to be no mechanism to isolate the driver to a specific part of the storage pool without creating a new SRP.

Logs

Snapshot overrun looks like this, and without a cleanup mechanism can consume all available capacity quickly.
k8s-7aa62eb054 44671607000001b2 198.18.184.22 f10b6eb600000180 16777216 ThinProvisioned
sn-8b12755c-b3a8-4a03-87a5-686e 446719ba000001e6 None f10b6eb600000180 16777216 Snapshot
sn-115352d5-c642-4952-83d8-834d 44671e8300000218 None f10b6eb600000180 16777216 Snapshot
sn-e3fda0ea-38f0-453d-ba7e-84b1 4467213d00000211 None f10b6eb600000180 16777216 Snapshot
sn-1361c500-6264-4600-b851-6a17 446721e400000216 None f10b6eb600000180 16777216 Snapshot
sn-6938be33-1ef7-488a-89a7-3e20 4467245f000001cf None f10b6eb600000180 16777216 Snapshot
sn-e6cf7c1c-28ce-4160-96d3-e084 44672670000002c0 None f10b6eb600000180 16777216 Snapshot
sn-7539ff51-e4a8-4fde-8f4f-ad40 446727b200000308 None f10b6eb600000180 16777216 Snapshot
sn-08b93f64-c835-4f29-9353-506f 4467296b00000349 None f10b6eb600000180 16777216 Snapshot
sn-4ca01182-ddab-4ac7-a633-1d44 44672c5c0000035e None f10b6eb600000180 16777216 Snapshot
sn-94997ed0-4405-439d-b40b-a067 44672e8100000282 None f10b6eb600000180 16777216 Snapshot

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

Configure the CSI driver with access to a shared storage pool.
Use a backup script or process that creates persistent volume snapshots in an unregulated manner.
Observe the storage pool's behavior as it reaches capacity.

Expected Behavior

The entire shared storage pool will reach capacity, taking all storage offline not just the container environment.

CSM Driver(s)

v1.12.0

Installation Type

No response

Container Storage Modules Enabled

No response

Container Orchestrator

Kubernetes

Operating System

Ubuntu 22.04 LTS 5.15.0-101-generic

@Bvreela Bvreela added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Nov 22, 2024
@Bvreela Bvreela changed the title [BUG]: Isolation Mechanism for CSI Driver in Shared Storage Pool enviorments [BUG]: Isolation Mechanism for CSI Driver in Shared Storage Pool environments Nov 22, 2024
@csmbot
Copy link
Collaborator

csmbot commented Nov 22, 2024

@Bvreela: Thank you for submitting this issue!

The issue is currently awaiting triage. Please make sure you have given us as much context as possible.

If the maintainers determine this is a relevant issue, they will remove the needs-triage label and respond appropriately.


We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.

@atye atye self-assigned this Nov 22, 2024
@atye
Copy link
Contributor

atye commented Nov 22, 2024

Hi @Bvreela. The driver executing incoming storage requests, whether it was intended or not intended via a faulty script, is normal behavior. It will execute the storage requests it receives. Your first sentence mentions PowerFlex. Are you using PowerFlex or PowerMax? Or both?

If you would like to enforce quota limits, I suggest you have a look at CSM Authorization v2. It supports setting quota limits for a PowerFlex array at the storage pool level and for PowerMax at the SRP level.

https://dell.github.io/csm-docs/docs/authorization/v2.x/
https://dell.github.io/csm-docs/docs/deployment/csmoperator/modules/authorization-v2.0/

@shanmydell shanmydell added this to the v1.13.0 milestone Nov 22, 2024
@shanmydell
Copy link
Collaborator

@atye : It is for the PowerMax

@shanmydell
Copy link
Collaborator

@Bvreela : Does the above response helps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests

5 participants