Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: CSI-PowerFlex entering boot loop when array has long response times #1639

Open
lukeatdell opened this issue Dec 12, 2024 · 0 comments
Assignees
Labels
area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex type/bug Something isn't working. This is the default label associated with a bug issue.

Comments

@lukeatdell
Copy link

Bug Description

When installing csi-powerflex driver (v2.12.0), if multiple powerflex arrays are provided in the secret, and one of the arrays is unreachable and takes a long time to respond, the driver controller is stuck in a boot loop trying to authenticate with the unreachable powerflex array.

Specifically, if one of the arrays does not respond before the timeout specified by the kubernetes sidecar in the driver deployment workload (.spec.template.spec.container[].args["--timeout=120s"]), this is when the issue is encountered. If the timeout is not specified, the default is 15s.

Logs

You can see here, there are 27 Probe requests before the first reply.
vxflexos-controller - driver container logs:

time="2024-12-12T21:37:10Z" level=info msg="/csi.v1.Identity/Probe: REQ 0024: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:10Z" level=debug msg="Probe called"
time="2024-12-12T21:37:10Z" level=debug msg=systemProbe
time="2024-12-12T21:37:10Z" level=info msg="Probing all arrays. Number of arrays: 3"
time="2024-12-12T21:37:10Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: c6d03fd700000407  from systemID: <array 2> \n"
time="2024-12-12T21:37:10Z" level=info msg="volumePrefixToSystems: systemID: <array 2>  already added for key c6d. Not adding for key again. \n"
time="2024-12-12T21:37:10Z" level=info msg="array <array 2> probed successfully"
time="2024-12-12T21:37:10Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: 70a19854000000d4  from systemID: <array 1> \n"
time="2024-12-12T21:37:10Z" level=info msg="volumePrefixToSystems: systemID: <array 1>  already added for key 70a. Not adding for key again. \n"
time="2024-12-12T21:37:10Z" level=info msg="array <array 1> probed successfully"
time="2024-12-12T21:37:11Z" level=info msg="/csi.v1.Identity/Probe: REQ 0025: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:11Z" level=debug msg="Probe called"
time="2024-12-12T21:37:11Z" level=debug msg=systemProbe
time="2024-12-12T21:37:11Z" level=info msg="Probing all arrays. Number of arrays: 3"
time="2024-12-12T21:37:11Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: c6d03fd700000407  from systemID: <array 2> \n"
time="2024-12-12T21:37:11Z" level=info msg="volumePrefixToSystems: systemID: <array 2>  already added for key c6d. Not adding for key again. \n"
time="2024-12-12T21:37:11Z" level=info msg="array <array 2> probed successfully"
time="2024-12-12T21:37:11Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: 70a19854000000d4  from systemID: <array 1> \n"
time="2024-12-12T21:37:11Z" level=info msg="volumePrefixToSystems: systemID: <array 1>  already added for key 70a. Not adding for key again. \n"
time="2024-12-12T21:37:11Z" level=info msg="array <array 1> probed successfully"
time="2024-12-12T21:37:12Z" level=info msg="/csi.v1.Identity/Probe: REQ 0026: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:12Z" level=debug msg="Probe called"
time="2024-12-12T21:37:12Z" level=debug msg=systemProbe
time="2024-12-12T21:37:12Z" level=info msg="Probing all arrays. Number of arrays: 3"
time="2024-12-12T21:37:12Z" level=info msg="/csi.v1.Identity/Probe: REQ 0027: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:12Z" level=debug msg="Probe called"
time="2024-12-12T21:37:12Z" level=debug msg=systemProbe
time="2024-12-12T21:37:12Z" level=info msg="Probing all arrays. Number of arrays: 3"
time="2024-12-12T21:37:12Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: 70a19854000000d4  from systemID: <array 1> \n"
time="2024-12-12T21:37:12Z" level=info msg="volumePrefixToSystems: systemID: <array 1>  already added for key 70a. Not adding for key again. \n"
time="2024-12-12T21:37:12Z" level=info msg="array <array 1> probed successfully"
time="2024-12-12T21:37:22Z" level=error msg="array <bad array> probe failed: rpc error: code = FailedPrecondition desc = unable to login to VxFlexOS Gateway: Get \"https://<bad array ip>/api/login\": dial tcp <bad array ip>:443: connect: connection timed out"
time="2024-12-12T21:37:22Z" level=error msg="array <bad array> probe failed: rpc error: code = FailedPrecondition desc = unable to login to VxFlexOS Gateway: Get \"https://<bad array ip>/api/login\": dial tcp <bad array ip>:443: connect: connection timed out"
time="2024-12-12T21:37:22Z" level=error msg="array <bad array> probe failed: rpc error: code = FailedPrecondition desc = unable to login to VxFlexOS Gateway: Get \"https://<bad array ip>/api/login\": dial tcp <bad array ip>:443: connect: connection timed out"
time="2024-12-12T21:37:22Z" level=debug msg="Probe returning: true"
time="2024-12-12T21:37:22Z" level=error msg="array <bad array> probe failed: rpc error: code = FailedPrecondition desc = unable to login to VxFlexOS Gateway: Get \"https://<bad array ip>/api/login\": dial tcp <bad array ip>:443: connect: connection timed out"
time="2024-12-12T21:37:22Z" level=debug msg="Probe returning: true"
time="2024-12-12T21:37:22Z" level=info msg="/csi.v1.Identity/Probe: REP 0002: Ready=value:true, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:22Z" level=info msg="/csi.v1.Identity/Probe: REP 0005: Ready=value:true, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:22Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: c6d03fd700000407  from systemID: <array 2> \n"
time="2024-12-12T21:37:22Z" level=info msg="volumePrefixToSystems: systemID: <array 2>  already added for key c6d. Not adding for key again. \n"
time="2024-12-12T21:37:22Z" level=info msg="array <array 2> probed successfully"
time="2024-12-12T21:37:22Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: c6d03fd700000407  from systemID: <array 2> \n"
time="2024-12-12T21:37:22Z" level=info msg="volumePrefixToSystems: systemID: <array 2>  already added for key c6d. Not adding for key again. \n"
time="2024-12-12T21:37:22Z" level=info msg="array <array 2> probed successfully"
time="2024-12-12T21:37:22Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: 70a19854000000d4  from systemID: <array 1> \n"
time="2024-12-12T21:37:22Z" level=info msg="volumePrefixToSystems: systemID: <array 1>  already added for key 70a. Not adding for key again. \n"
time="2024-12-12T21:37:22Z" level=info msg="array <array 1> probed successfully"
time="2024-12-12T21:37:22Z" level=debug msg="Probe returning: true"
time="2024-12-12T21:37:22Z" level=info msg="/csi.v1.Identity/Probe: REP 0001: Ready=value:true, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:22Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: 70a19854000000d4  from systemID: <array 1> \n"
time="2024-12-12T21:37:22Z" level=info msg="volumePrefixToSystems: systemID: <array 1>  already added for key 70a. Not adding for key again. \n"
time="2024-12-12T21:37:22Z" level=info msg="array <array 1> probed successfully"
time="2024-12-12T21:37:22Z" level=debug msg="Probe returning: true"
time="2024-12-12T21:37:22Z" level=info msg="/csi.v1.Identity/Probe: REP 0004: Ready=value:true, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:22Z" level=info msg="/csi.v1.Identity/Probe: REQ 0028: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

Install csi-powerflex with >=2 storage systems configured in the secret.
One of the storage systems should be unreachable and take longer than 2 minutes to respond.

Expected Behavior

Ideally, if one of the arrays responds within the given timeout, the driver should continue with initialization, eventually entering a running state and allowing the user to perform storage maintenance actions against all powerflex arrays that are online.

CSM Driver(s)

csi-powerflex:v2.12.0

Installation Type

Operator:v1.7.0

Container Storage Modules Enabled

N/A

Container Orchestrator

OpenShift v4.17.7

Operating System

RHEL 9.4

@lukeatdell lukeatdell added type/bug Something isn't working. This is the default label associated with a bug issue. area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex labels Dec 12, 2024
@lukeatdell lukeatdell self-assigned this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests

1 participant