Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: CSM Operator 1.5.0 crashloopbackoff #1637

Open
OA72280 opened this issue Dec 11, 2024 · 9 comments
Open

[BUG]: CSM Operator 1.5.0 crashloopbackoff #1637

OA72280 opened this issue Dec 11, 2024 · 9 comments
Labels
area/csm-operator type/bug Something isn't working. This is the default label associated with a bug issue.

Comments

@OA72280
Copy link

OA72280 commented Dec 11, 2024

Bug Description

Customer has attempted to use both Red Hat OperatorHub (Mirrored) as well as offline installer process to deploy CSM 1.5.0 in OpenShift 4.14.x - Operator loads momentarily and then goes into - crashloopbackoff

Note - The users environment had already had the PowerFlex CSI deployed, though, with the snapshotter sidecar. Upon updating their CR file and running #oc replace -f cr-file.yaml nothing happened. The snapshotter was not being removed.

Validated it is not the CR file. Took at look at the environment and the operator crashing is obviously why when applying the updated CR nothing is happening.

Cleared the CSI driver out and operator but the operator is now incapable of deploying.

Logs

2024-12-10T22:29:56.977Z DEBUG workspace/main.go:87 Operator Version {"TraceId": "main", "Version": "1.5.0", "Commit ID": "5a711be500ab85c70be0dea037f3c97ba77d20b2", "Commit SHA": "Tue, 19 Mar 2024 14:31:11 UTC"}
2024-12-10T22:29:56.978Z DEBUG workspace/main.go:88 Go Version: go1.22.1 {"TraceId": "main"}
2024-12-10T22:29:56.978Z DEBUG workspace/main.go:89 Go OS/Arch: linux/amd64 {"TraceId": "main"}
2024-12-10T22:29:57.180Z INFO workspace/main.go:101 Openshift environment {"TraceId": "main"}
2024-12-10T22:29:57.184Z INFO workspace/main.go:140 Current kubernetes version is 1.27 which is a supported version {"TraceId": "main"}
2024-12-10T22:29:57.184Z INFO workspace/main.go:151 Use ConfigDirectory /etc/config/dell-csm-operator {"TraceId": "main"}
2024-12-10T22:29:57Z INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8082"}
2024-12-10T22:29:57Z INFO setup starting manager
2024-12-10T22:29:57Z INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-12-10T22:29:57Z INFO starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8082"}
I1210 22:29:57.276792 1 leaderelection.go:245] attempting to acquire leader lease -dell-csm-operator-168737/090cae6a.dell.com...
I1210 22:30:14.787155 1 leaderelection.go:255] successfully acquired lease -dell-csm-operator-168737/090cae6a.dell.com
2024-12-10T22:30:14Z DEBUG events dell-csm-operator-controller-manager-fc9c5f796-njg5x_0b2eccbf-e65c-4861-9257-9d55487737df became leader {"type": "Normal", "object": {"kind":"Lease","namespace":"-dell-csm-operator-168737","name":"090cae6a.dell.com","uid":"2e6e47e1-23bf-4021-86b1-7bd8e5453ee9","apiVersion":"coordination.k8s.io/v1","resourceVersion":"720572476"}, "reason": "LeaderElection"}
2024-12-10T22:30:14Z INFO Starting EventSource {"controller": "containerstoragemodule", "controllerGroup": "storage.dell.com", "controllerKind": "ContainerStorageModule", "source": "kind source: *v1.ContainerStorageModule"}
2024-12-10T22:30:14Z INFO Starting Controller {"controller": "containerstoragemodule", "controllerGroup": "storage.dell.com", "controllerKind": "ContainerStorageModule"}
2024-12-10T22:30:14Z INFO Starting EventSource {"controller": "apexconnectivityclient", "controllerGroup": "storage.dell.com", "controllerKind": "ApexConnectivityClient", "source": "kind source: *v1.ApexConnectivityClient"}
2024-12-10T22:30:14Z INFO Starting Controller {"controller": "apexconnectivityclient", "controllerGroup": "storage.dell.com", "controllerKind": "ApexConnectivityClient"}
2024-12-10T22:30:14Z INFO Starting workers {"controller": "containerstoragemodule", "controllerGroup": "storage.dell.com", "controllerKind": "ContainerStorageModule", "worker count": 1}
2024-12-10T22:30:14Z INFO Starting workers {"controller": "apexconnectivityclient", "controllerGroup": "storage.dell.com", "controllerKind": "ApexConnectivityClient", "worker count": 1}
2024-12-10T22:30:14.891Z INFO controllers/csm_controller.go:240 ################Starting Reconcile############## {"TraceId": "powerflex-1"}
2024-12-10T22:30:14.891Z INFO controllers/csm_controller.go:243 reconcile for {"TraceId": "powerflex-1", "Namespace": "-powerflex-168737", "Name": "powerflex", "Attempt": 1}

--Additional logs from previous pods--
I1210 21:40:32.681491 1 trace.go:236] Trace[1662988865]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.2/tools/cache/reflector.go:231 (10-Dec-2024 21:40:22.275) (total time: 10406ms):
Trace[1662988865]: ---"Objects listed" error: 10405ms (21:40:32.680)

2024-12-10T21:42:38.573Z INFO utils/status.go:304 error from getDeploymentStatus: Deployment.apps "powerflex-controller" not found {"TraceId": "powerflex-0"}

2024-12-10T21:46:29.776Z INFO utils/status.go:315 calculate Daemonseterror msg [DaemonSet.apps "powerflex-node" not found] {"TraceId": "powerflex-1"}

Screenshots

N/A

Additional Environment Information

N/A

Steps to Reproduce

N/A

Expected Behavior

N/A

CSM Driver(s)

CSM 1.5.0 with PowerFlex 2.10.0 CR

Installation Type

Offline Installer and OperatorHub

Container Storage Modules Enabled

Authorization

Enabled and Disabled, does not matter, either way results in the same outcome.

Container Orchestrator

OpenShift

Operating System

RHCOS

@OA72280 OA72280 added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Dec 11, 2024
@csmbot
Copy link
Collaborator

csmbot commented Dec 11, 2024

@OA72280: Thank you for submitting this issue!

The issue is currently awaiting triage. Please make sure you have given us as much context as possible.

If the maintainers determine this is a relevant issue, they will remove the needs-triage label and respond appropriately.


We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.

@alikdell
Copy link
Contributor

@OA72280 I am not able to understand the issue.

  1. Is Operator working now or still in crashloopbackoff state? If so can you provide pod describe or anything that provide use why Operator is crashing?
  2. If Operator is working now, from the given limited logs, it is highly unlike we will be able to find out what is going on. We need more info/logs

@OA72280
Copy link
Author

OA72280 commented Dec 12, 2024

Customer reports operator came up long enough to roll the CSI driver without the snapshotter side car and is now crashing again.

@alikdell alikdell removed the needs-triage Issue requires triage. label Dec 12, 2024
@alikdell
Copy link
Contributor

/sync

@donatwork
Copy link
Contributor

How was the PowerFlex driver installed? With (CSI/CSM?) Operator or other means. Was PowerFlex already installed prior to installing Operator? Please get us an accurate timeline of what was done. We can try to reproduce but the logs that you have and the description of the problem is vague. Thanks.

@OA72280
Copy link
Author

OA72280 commented Dec 12, 2024

The CSI driver was deployed and functional .... it was only when the customer went to remove the snapshotter

they edited the CR, disable: true for snapshotter.

Tried to apply the CR - nothing happened...they checked the status of the operator and learned it was failing. Not sure how long it had been failing, could be hours, days, weeks... but since the CSI was ok, no one knew.

@hoppea2
Copy link
Collaborator

hoppea2 commented Dec 13, 2024

/sync

@csmbot
Copy link
Collaborator

csmbot commented Dec 13, 2024

link: 31238

@panigs7
Copy link
Contributor

panigs7 commented Dec 16, 2024

@OA72280 Could you share the following data as well:

  1. The output of oc describe csm vxflexos -n
  2. The cr.yaml used to install powerflex driver
    Also, do you see the operator crashing if there is no driver installed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csm-operator type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests

7 participants