[BUG]: CSI-PowerFlex entering boot loop when array has long response times #1639

lukeatdell · 2024-12-12T21:45:44Z

Bug Description

When installing csi-powerflex driver (v2.12.0), if multiple powerflex arrays are provided in the secret, and one of the arrays is unreachable and takes a long time to respond, the driver controller is stuck in a boot loop trying to authenticate with the unreachable powerflex array.

Specifically, if one of the arrays does not respond before the timeout specified by the kubernetes sidecar in the driver deployment workload (.spec.template.spec.container[].args["--timeout=120s"]), this is when the issue is encountered. If the timeout is not specified, the default is 15s.

Logs

You can see here, there are 27 Probe requests before the first reply.
vxflexos-controller - driver container logs:

time="2024-12-12T21:37:10Z" level=info msg="/csi.v1.Identity/Probe: REQ 0024: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:10Z" level=debug msg="Probe called"
time="2024-12-12T21:37:10Z" level=debug msg=systemProbe
time="2024-12-12T21:37:10Z" level=info msg="Probing all arrays. Number of arrays: 3"
time="2024-12-12T21:37:10Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: c6d03fd700000407  from systemID: <array 2> \n"
time="2024-12-12T21:37:10Z" level=info msg="volumePrefixToSystems: systemID: <array 2>  already added for key c6d. Not adding for key again. \n"
time="2024-12-12T21:37:10Z" level=info msg="array <array 2> probed successfully"
time="2024-12-12T21:37:10Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: 70a19854000000d4  from systemID: <array 1> \n"
time="2024-12-12T21:37:10Z" level=info msg="volumePrefixToSystems: systemID: <array 1>  already added for key 70a. Not adding for key again. \n"
time="2024-12-12T21:37:10Z" level=info msg="array <array 1> probed successfully"
time="2024-12-12T21:37:11Z" level=info msg="/csi.v1.Identity/Probe: REQ 0025: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:11Z" level=debug msg="Probe called"
time="2024-12-12T21:37:11Z" level=debug msg=systemProbe
time="2024-12-12T21:37:11Z" level=info msg="Probing all arrays. Number of arrays: 3"
time="2024-12-12T21:37:11Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: c6d03fd700000407  from systemID: <array 2> \n"
time="2024-12-12T21:37:11Z" level=info msg="volumePrefixToSystems: systemID: <array 2>  already added for key c6d. Not adding for key again. \n"
time="2024-12-12T21:37:11Z" level=info msg="array <array 2> probed successfully"
time="2024-12-12T21:37:11Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: 70a19854000000d4  from systemID: <array 1> \n"
time="2024-12-12T21:37:11Z" level=info msg="volumePrefixToSystems: systemID: <array 1>  already added for key 70a. Not adding for key again. \n"
time="2024-12-12T21:37:11Z" level=info msg="array <array 1> probed successfully"
time="2024-12-12T21:37:12Z" level=info msg="/csi.v1.Identity/Probe: REQ 0026: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:12Z" level=debug msg="Probe called"
time="2024-12-12T21:37:12Z" level=debug msg=systemProbe
time="2024-12-12T21:37:12Z" level=info msg="Probing all arrays. Number of arrays: 3"
time="2024-12-12T21:37:12Z" level=info msg="/csi.v1.Identity/Probe: REQ 0027: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:12Z" level=debug msg="Probe called"
time="2024-12-12T21:37:12Z" level=debug msg=systemProbe
time="2024-12-12T21:37:12Z" level=info msg="Probing all arrays. Number of arrays: 3"
time="2024-12-12T21:37:12Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: 70a19854000000d4  from systemID: <array 1> \n"
time="2024-12-12T21:37:12Z" level=info msg="volumePrefixToSystems: systemID: <array 1>  already added for key 70a. Not adding for key again. \n"
time="2024-12-12T21:37:12Z" level=info msg="array <array 1> probed successfully"
time="2024-12-12T21:37:22Z" level=error msg="array <bad array> probe failed: rpc error: code = FailedPrecondition desc = unable to login to VxFlexOS Gateway: Get \"https://<bad array ip>/api/login\": dial tcp <bad array ip>:443: connect: connection timed out"
time="2024-12-12T21:37:22Z" level=error msg="array <bad array> probe failed: rpc error: code = FailedPrecondition desc = unable to login to VxFlexOS Gateway: Get \"https://<bad array ip>/api/login\": dial tcp <bad array ip>:443: connect: connection timed out"
time="2024-12-12T21:37:22Z" level=error msg="array <bad array> probe failed: rpc error: code = FailedPrecondition desc = unable to login to VxFlexOS Gateway: Get \"https://<bad array ip>/api/login\": dial tcp <bad array ip>:443: connect: connection timed out"
time="2024-12-12T21:37:22Z" level=debug msg="Probe returning: true"
time="2024-12-12T21:37:22Z" level=error msg="array <bad array> probe failed: rpc error: code = FailedPrecondition desc = unable to login to VxFlexOS Gateway: Get \"https://<bad array ip>/api/login\": dial tcp <bad array ip>:443: connect: connection timed out"
time="2024-12-12T21:37:22Z" level=debug msg="Probe returning: true"
time="2024-12-12T21:37:22Z" level=info msg="/csi.v1.Identity/Probe: REP 0002: Ready=value:true, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:22Z" level=info msg="/csi.v1.Identity/Probe: REP 0005: Ready=value:true, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:22Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: c6d03fd700000407  from systemID: <array 2> \n"
time="2024-12-12T21:37:22Z" level=info msg="volumePrefixToSystems: systemID: <array 2>  already added for key c6d. Not adding for key again. \n"
time="2024-12-12T21:37:22Z" level=info msg="array <array 2> probed successfully"
time="2024-12-12T21:37:22Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: c6d03fd700000407  from systemID: <array 2> \n"
time="2024-12-12T21:37:22Z" level=info msg="volumePrefixToSystems: systemID: <array 2>  already added for key c6d. Not adding for key again. \n"
time="2024-12-12T21:37:22Z" level=info msg="array <array 2> probed successfully"
time="2024-12-12T21:37:22Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: 70a19854000000d4  from systemID: <array 1> \n"
time="2024-12-12T21:37:22Z" level=info msg="volumePrefixToSystems: systemID: <array 1>  already added for key 70a. Not adding for key again. \n"
time="2024-12-12T21:37:22Z" level=info msg="array <array 1> probed successfully"
time="2024-12-12T21:37:22Z" level=debug msg="Probe returning: true"
time="2024-12-12T21:37:22Z" level=info msg="/csi.v1.Identity/Probe: REP 0001: Ready=value:true, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:22Z" level=info msg="vol id in UpdateVolumePrefixToSystemsMap is: 70a19854000000d4  from systemID: <array 1> \n"
time="2024-12-12T21:37:22Z" level=info msg="volumePrefixToSystems: systemID: <array 1>  already added for key 70a. Not adding for key again. \n"
time="2024-12-12T21:37:22Z" level=info msg="array <array 1> probed successfully"
time="2024-12-12T21:37:22Z" level=debug msg="Probe returning: true"
time="2024-12-12T21:37:22Z" level=info msg="/csi.v1.Identity/Probe: REP 0004: Ready=value:true, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2024-12-12T21:37:22Z" level=info msg="/csi.v1.Identity/Probe: REQ 0028: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

Install csi-powerflex with >=2 storage systems configured in the secret.
One of the storage systems should be unreachable and take longer than 2 minutes to respond.

Expected Behavior

Ideally, if one of the arrays responds within the given timeout, the driver should continue with initialization, eventually entering a running state and allowing the user to perform storage maintenance actions against all powerflex arrays that are online.

CSM Driver(s)

csi-powerflex:v2.12.0

Installation Type

Operator:v1.7.0

Container Storage Modules Enabled

N/A

Container Orchestrator

OpenShift v4.17.7

Operating System

RHEL 9.4

The text was updated successfully, but these errors were encountered:

lukeatdell added type/bug Something isn't working. This is the default label associated with a bug issue. area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex labels Dec 12, 2024

lukeatdell self-assigned this Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: CSI-PowerFlex entering boot loop when array has long response times #1639

[BUG]: CSI-PowerFlex entering boot loop when array has long response times #1639

lukeatdell commented Dec 12, 2024

[BUG]: CSI-PowerFlex entering boot loop when array has long response times #1639

[BUG]: CSI-PowerFlex entering boot loop when array has long response times #1639

Comments

lukeatdell commented Dec 12, 2024

Bug Description

Logs

Screenshots

Additional Environment Information

Steps to Reproduce

Expected Behavior

CSM Driver(s)

Installation Type

Container Storage Modules Enabled

Container Orchestrator

Operating System