OperatorSDK issue after restarting neon-cluster-operator? #1852

jefflill · 2023-08-19T00:19:50Z

It looks like the OperatorSDK may be having problems reestablishing webhooks after restarting the operator.

I restarted neon-cluster-operator after setting LOG_LEVEL=trace when trying to debug the performance issue. The API Server immediately has fairly high CPU usage and the API Server looks like it's unable to send webhook requests to the new neon-cluster-operator pod (you can also see the neon-acme OpenAPIs intermixed as well #1847):

{"ts":1692403829344.051,"caller":"openapi/controller.go:116","msg":"loading OpenAPI spec for \"v1alpha1.acme.neoncloud.io\" failed with: OpenAPI spec does not exist\n"}
{"ts":1692403829344.0842,"caller":"openapi/controller.go:129","msg":"OpenAPI AggregationController: action for item v1alpha1.acme.neoncloud.io: Rate Limited Requeue.\n","v":0}
{"ts":1692403836935.4392,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403836935.4954,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403846959.5413,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403846959.5977,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403856987.9785,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403856988.0146,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403867015.4243,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403867015.4587,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403877039.0762,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403877039.1125,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403887063.5889,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403887063.643,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403897103.4758,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403897103.534,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}

The text was updated successfully, but these errors were encountered:

marcusbooyah · 2023-08-28T06:18:58Z

Was this a single node cluster? Did it not go away once the operator started up?

marcusbooyah · 2023-08-30T00:05:06Z

I don't think this is an issue

jefflill · 2023-08-30T00:39:00Z

Yeah, it was probably a single node cluster. This is an example of the sort of thing I've been seeing in logs that seemed a bit weird, so I'm creating issues.

...not sure it's a problem either.

jefflill assigned marcusbooyah Aug 19, 2023

jefflill added bug Identifies a bug or other failure neon-kube Related to our Kubernetes distribution cluster-operators Related to one of our cluster operators labels Aug 19, 2023

jefflill changed the title ~~OperatorSDK issue after restarting neon-cluster-operator~~ OperatorSDK issue after restarting neon-cluster-operator? Aug 19, 2023

jefflill added the investigate Needs further investigation label Aug 19, 2023

jefflill mentioned this issue Aug 20, 2023

PERF: neon-cluster-operator after a few hours or cluster restart #1851

Closed

marcusbooyah closed this as completed Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OperatorSDK issue after restarting neon-cluster-operator? #1852

OperatorSDK issue after restarting neon-cluster-operator? #1852

jefflill commented Aug 19, 2023 •

edited

Loading

marcusbooyah commented Aug 28, 2023

marcusbooyah commented Aug 30, 2023

jefflill commented Aug 30, 2023

OperatorSDK issue after restarting neon-cluster-operator? #1852

OperatorSDK issue after restarting neon-cluster-operator? #1852

Comments

jefflill commented Aug 19, 2023 • edited Loading

marcusbooyah commented Aug 28, 2023

marcusbooyah commented Aug 30, 2023

jefflill commented Aug 30, 2023

jefflill commented Aug 19, 2023 •

edited

Loading