CertManager/ACME OpenAPI discovery? #1847

jefflill · 2023-08-15T18:12:34Z

It looks like the API server's performance performance problems are actually related to CertManager/ACME right now instead of neon-cluster-operator who also had issues in the past, but it looks like those have been fixed.

This problem doesn't surface right away. I was able to repro this by deploying a desktop cluster and letting it run overnight. It takes a while for the cluster CPU utilization to start spiking, with the API server consuming over 1 CPU.

I looked at the API server logs and see that the API server is being slammed by these:

{"ts":1692121527862.3567,"caller":"openapi/controller.go:129","msg":"OpenAPI AggregationController: action for item v1alpha1.acme.neoncloud.io: Rate Limited Requeue.\n","v":0}
{"ts":1692121647864.3167,"caller":"openapi/controller.go:116","msg":"loading OpenAPI spec for \"v1alpha1.acme.neoncloud.io\" failed with: OpenAPI spec does not exist\n"}
{"ts":1692121647864.344,"caller":"openapi/controller.go:129","msg":"OpenAPI AggregationController: action for item v1alpha1.acme.neoncloud.io: Rate Limited Requeue.\n","v":0}
{"ts":1692121707865.079,"caller":"openapi/controller.go:116","msg":"loading OpenAPI spec for \"v1alpha1.acme.neoncloud.io\" failed with: OpenAPI spec does not exist\n"}
{"ts":1692121707865.2048,"caller":"openapi/controller.go:129","msg":"OpenAPI AggregationController: action for item v1alpha1.acme.neoncloud.io: Rate Limited Requeue.\n","v":0}
{"ts":1692121827866.2754,"caller":"openapi/controller.go:116","msg":"loading OpenAPI spec for \"v1alpha1.acme.neoncloud.io\" failed with: OpenAPI spec does not exist\n"}
{"ts":1692121827866.3018,"caller":"openapi/controller.go:129","msg":"OpenAPI AggregationController: action for item v1alpha1.acme.neoncloud.io: Rate Limited Requeue.\n","v":0}
{"ts":1692121947864.411,"caller":"openapi/controller.go:116","msg":"loading OpenAPI spec for \"v1alpha1.acme.neoncloud.io\" failed with: OpenAPI spec does not exist\n"}
{"ts":1692121947864.461,"caller":"openapi/controller.go:129","msg":"OpenAPI AggregationController: action for item v1alpha1.acme.neoncloud.io: Rate Limited Requeue.\n","v":0}
{"ts":1692122007865.413,"caller":"openapi/controller.go:116","msg":"loading OpenAPI spec for \"v1alpha1.acme.neoncloud.io\" failed with: OpenAPI spec does not exist\n"}
{"ts":1692122007865.5103,"caller":"openapi/controller.go:129","msg":"OpenAPI AggregationController: action for item v1alpha1.acme.neoncloud.io: Rate Limited Requeue.\n","v":0}
{"ts":1692122127866.718,"caller":"openapi/controller.go:116","msg":"loading OpenAPI spec for \"v1alpha1.acme.neoncloud.io\" failed with: OpenAPI spec does not exist\n"}
{"ts":1692122127866.7664,"caller":"openapi/controller.go:129","msg":"OpenAPI AggregationController: action for item v1alpha1.acme.neoncloud.io: Rate Limited Requeue.\n","v":0}
{"ts":1692122247864.4133,"caller":"openapi/controller.go:116","msg":"loading OpenAPI spec for \"v1alpha1.acme.neoncloud.io\" failed with: OpenAPI spec does not exist\n"}
{"ts":1692122247864.544,"caller":"openapi/controller.go:129","msg":"OpenAPI AggregationController: action for item v1alpha1.acme.neoncloud.io: Rate Limited Requeue.\n","v":0}
{"ts":1692122307865.4695,"caller":"openapi/controller.go:116","msg":"loading OpenAPI spec for \"v1alpha1.acme.neoncloud.io\" failed with: OpenAPI spec does not exist\n"}
{"ts":1692122307865.5137,"caller":"openapi/controller.go:129","msg":"OpenAPI AggregationController: action for item v1alpha1.acme.neoncloud.io: Rate Limited Requeue.\n","v":0}

The text was updated successfully, but these errors were encountered:

jefflill · 2023-08-15T23:32:40Z

@marcusbooyah: I'm not entirely sure what's happening here.

It appears that you've defined and API server aggregation endpoint called v1alpha1.acme.neoncloud.io that forwards traffic to the neon-acme deployment which then forwards traffic to the neon-cloud headend. I've listed the API services and I do see v1alpha1.acme.neoncloud.io listed there and it looks like it's referencing neon-ingress/neon-acme correctly.

I set the neon-acme log level to debug and captured these logs:

{"tsNs":1692141657016952100,"severity":"Debug","body":"Headers: {\"Host\":[\"10.253.114.38:443\"],\"User-Agent\":[\"Go-http-client/2.0\"],\"Accept-Encoding\":[\"gzip\"],\"x-remote-user\":[\"system:kube-aggregator\"],\"x-remote-group\":[\"system:masters\"]}","categoryName":"WEB-Acme","severityNumber":5,"attributes":{"dotnet.ilogger.category":"WEB-Acme","neon.index":120,"traceid":"5ee5305a9a969ae1b40234ae0e18f57f","spanid":"dd24ddfdd1318da8"},"resources":{"service.name":"neon-acme","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"315dcce3-b814-45b1-9eea-4ca2be93c581"},"spanId":"dd24ddfdd1318da8","traceId":"5ee5305a9a969ae1b40234ae0e18f57f"}
{"tsNs":1692141657017874600,"severity":"Debug","body":"Headers: {\"Host\":[\"10.253.114.38:443\"],\"User-Agent\":[\"Go-http-client/2.0\"],\"Accept-Encoding\":[\"gzip\"],\"x-remote-user\":[\"system:kube-aggregator\"],\"x-remote-group\":[\"system:masters\"]}","categoryName":"WEB-Acme","severityNumber":5,"attributes":{"dotnet.ilogger.category":"WEB-Acme","neon.index":123,"traceid":"39f562013f58b61b8848142db31c01da","spanid":"5633b3e5e64175ad"},"resources":{"service.name":"neon-acme","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"315dcce3-b814-45b1-9eea-4ca2be93c581"},"spanId":"5633b3e5e64175ad","traceId":"39f562013f58b61b8848142db31c01da"}
{"tsNs":1692141657017355700,"severity":"Debug","body":"Headers: {\"Host\":[\"10.253.114.38:443\"],\"User-Agent\":[\"Go-http-client/2.0\"],\"Accept-Encoding\":[\"gzip\"],\"x-remote-user\":[\"system:kube-aggregator\"],\"x-remote-group\":[\"system:masters\"]}","categoryName":"WEB-Acme","severityNumber":5,"attributes":{"dotnet.ilogger.category":"WEB-Acme","neon.index":122,"traceid":"6306f02d50e465076d71b48b1ed2d0f3","spanid":"69ac28cf10381ecb"},"resources":{"service.name":"neon-acme","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"315dcce3-b814-45b1-9eea-4ca2be93c581"},"spanId":"69ac28cf10381ecb","traceId":"6306f02d50e465076d71b48b1ed2d0f3"}
{"tsNs":1692141657017094100,"severity":"Debug","body":"Headers: {\"Host\":[\"10.253.114.38:443\"],\"User-Agent\":[\"Go-http-client/2.0\"],\"Accept-Encoding\":[\"gzip\"],\"x-remote-group\":[\"system:masters\"],\"x-remote-user\":[\"system:kube-aggregator\"]}","categoryName":"WEB-Acme","severityNumber":5,"attributes":{"dotnet.ilogger.category":"WEB-Acme","neon.index":121,"traceid":"145f16dcb37df0a89a82da8b2817dd8c","spanid":"bc89573a51a16861"},"resources":{"service.name":"neon-acme","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"315dcce3-b814-45b1-9eea-4ca2be93c581"},"spanId":"bc89573a51a16861","traceId":"145f16dcb37df0a89a82da8b2817dd8c"}
{"tsNs":1692141657350732000,"severity":"Debug","body":"Headers: {\"Accept\":[\"application/json, */*\"],\"Host\":[\"10.253.114.38:443\"],\"User-Agent\":[\"jiva-csi/v0.0.0 (linux/amd64) kubernetes/$Format\"],\"Accept-Encoding\":[\"gzip\"],\"x-remote-user\":[\"system:serviceaccount:neon-storage:openebs-jiva-csi-node-sa\"],\"x-remote-extra-authentication.kubernetes.io%2fpod-uid\":[\"389d612b-43be-40ef-b11a-7908610d0820\"],\"audit-id\":[\"0e06b433-d47e-4017-a40f-3ff25c199251\"],\"x-forwarded-for\":[\"100.64.0.254\"],\"x-forwarded-uri\":[\"/apis/acme.neoncloud.io/v1alpha1\"],\"x-forwarded-host\":[\"10.253.114.38:443\"],\"x-forwarded-proto\":[\"https\"],\"x-remote-group\":[\"system:serviceaccounts\",\"system:serviceaccounts:neon-storage\",\"system:authenticated\"],\"x-remote-extra-authentication.kubernetes.io%2fpod-name\":[\"openebs-jiva-operator-csi-node-wkc7r\"]}","categoryName":"WEB-Acme","severityNumber":5,"attributes":{"dotnet.ilogger.category":"WEB-Acme","neon.index":124,"traceid":"e3303a11fb3a78d3e1394ed8f7ff96ba","spanid":"6c0438e82cc424e0"},"resources":{"service.name":"neon-acme","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"315dcce3-b814-45b1-9eea-4ca2be93c581"},"spanId":"6c0438e82cc424e0","traceId":"e3303a11fb3a78d3e1394ed8f7ff96ba"}
{"tsNs":1692141657824574000,"severity":"Debug","body":"Headers: {\"Accept\":[\"application/vnd.kubernetes.protobuf, */*\"],\"Host\":[\"10.253.114.38:443\"],\"User-Agent\":[\"kube-controller-manager/v1.24.0 (linux/amd64) kubernetes/4ce5a89/controller-discovery\"],\"Accept-Encoding\":[\"gzip\"],\"x-forwarded-for\":[\"100.64.0.254\"],\"x-forwarded-uri\":[\"/apis/acme.neoncloud.io/v1alpha1\"],\"x-remote-group\":[\"system:authenticated\"],\"audit-id\":[\"61ec9914-ab82-4d5b-96b6-8e5650055577\"],\"x-forwarded-host\":[\"10.253.114.38:443\"],\"x-forwarded-proto\":[\"https\"],\"x-remote-user\":[\"system:kube-controller-manager\"]}","categoryName":"WEB-Acme","severityNumber":5,"attributes":{"dotnet.ilogger.category":"WEB-Acme","neon.index":125,"traceid":"ab103faf44baa3bf2a1f83600420da14","spanid":"aaba797e6cf15a26"},"resources":{"service.name":"neon-acme","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"315dcce3-b814-45b1-9eea-4ca2be93c581"},"spanId":"aaba797e6cf15a26","traceId":"ab103faf44baa3bf2a1f83600420da14"}
{"tsNs":1692141662343712900,"severity":"Debug","body":"Headers: {\"Accept\":[\"application/json, */*\"],\"Host\":[\"10.253.114.38:443\"],\"User-Agent\":[\"jiva-csi/v0.0.0 (linux/amd64) kubernetes/$Format\"],\"Accept-Encoding\":[\"gzip\"],\"x-forwarded-for\":[\"100.64.0.254\"],\"x-forwarded-uri\":[\"/apis/acme.neoncloud.io/v1alpha1\"],\"audit-id\":[\"8364404a-46fc-403d-85cb-0f4c732c3dc9\"],\"x-remote-group\":[\"system:serviceaccounts\",\"system:serviceaccounts:neon-storage\",\"system:authenticated\"],\"x-forwarded-host\":[\"10.253.114.38:443\"],\"x-forwarded-proto\":[\"https\"],\"x-remote-user\":[\"system:serviceaccount:neon-storage:openebs-jiva-csi-node-sa\"],\"x-remote-extra-authentication.kubernetes.io%2fpod-name\":[\"openebs-jiva-operator-csi-node-wkc7r\"],\"x-remote-extra-authentication.kubernetes.io%2fpod-uid\":[\"389d612b-43be-40ef-b11a-7908610d0820\"]}","categoryName":"WEB-Acme","severityNumber":5,"attributes":{"dotnet.ilogger.category":"WEB-Acme","neon.index":126,"traceid":"15f2a0af0ec715bf2a87b7d0ed1b6b11","spanid":"311ac7eba204ce96"},"resources":{"service.name":"neon-acme","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"315dcce3-b814-45b1-9eea-4ca2be93c581"},"spanId":"311ac7eba204ce96","traceId":"15f2a0af0ec715bf2a87b7d0ed1b6b11"}

...so it looks like the API calls are being forwarded by the API server to neon-acme. We don't see the response status code in the logs.

jefflill · 2023-08-16T00:11:03Z

@marcusbooyah: I poked around the Kubernetes AggregationController source code and it looks like the API server might be trying trying to retrieve information about the local service (neon-acme in this case) and that's failing.

The GOLANG code is hard to follow and it'd not clear what API server is looking for. The code seems to be related to monitoring the APIService for changes and this is failing and it's failing often enough that operation is being rate-limited, which explains the CPU load we're seeing.

The API Service looks OK though:

C:\Users\jeff>neon get apiservice v1alpha1.acme.neoncloud.io -o=yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  annotations:
    cert-manager.io/inject-ca-from: neon-ingress/neon-acme-webhook-tls
    meta.helm.sh/release-name: neon-acme
    meta.helm.sh/release-namespace: neon-ingress
  creationTimestamp: "2023-08-12T23:58:45Z"
  labels:
    app: neon-acme
    app.kubernetes.io/managed-by: Helm
  name: v1alpha1.acme.neoncloud.io
  resourceVersion: "34653"
  uid: b89027d8-d482-494b-9d2e-451905386b34
spec:
  caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURGVENDQWYyZ0F3SUJBZ0lSQUxRZVR4TUF4Q1ZIcXYrZjM4ZUJXUTh3RFFZSktvWklodmNOQVFFTEJRQXcKSkRFaU1DQUdBMVVFQXhNWlkyRXVibVZ2YmkxaFkyMWxMbTVsYjI0dGFXNW5jbVZ6Y3pBZUZ3MHlNekE0TVRJeQpNelU0TkRaYUZ3MHlNekV4TVRBeU16VTRORFphTUNReElqQWdCZ05WQkFNVEdXTmhMbTVsYjI0dFlXTnRaUzV1ClpXOXVMV2x1WjNKbGMzTXdnZ0VpTUEwR0NTcUdTSWIzRFFFQkFRVUFBNElCRHdBd2dnRUtBb0lCQVFENGhob2oKRUQvYkZQdEJKeEV6Rk1PT3NDd0JyaXlMNVo1RXl1NUVxOEhyQmI1c2NIRUljQnhCdmJpWjZpKzdGVVJXMis2ZQpHMGlYS2lIODFZN1FST21TOURhb2VGSkhXVUUwS1BmdmErWkJDazJPak1wVVZkVXlLZXdZVjJadmdOOHpjMk5GCm1aeFMvT1ZJTG44dVJVSmNaR0F2M1p0bDJQNWRmN2RGUFgvSWhTUFRpMzRHU1BQcnkzeGQ1SThDQ1FucEM3WFUKQ2RPSVVCRGNvNHYvZVRBSXNGYzl0NmZNcUFMUk96ZFd5bTdkNnBldGlyZFRTa3BtN3Ivb3BqYXpzQmpjNHlJYQppYmtmaTd0RXRYVnk0QWhUQVgwMjFDZFFZQit6TFFKRVZ1Qmk1eVpQQncydUZIZWZvenNYSjNmYUF1ZWljREVVCnVqTHBKTEhyTTlZc2tscW5BZ01CQUFHalFqQkFNQTRHQTFVZER3RUIvd1FFQXdJQ3BEQVBCZ05WSFJNQkFmOEUKQlRBREFRSC9NQjBHQTFVZERnUVdCQlIwS0xnakp2T3RidFp4QnFTVEhkdGdnbDI4cFRBTkJna3Foa2lHOXcwQgpBUXNGQUFPQ0FRRUFjS0Evclh0MXJHRWc0eHVHOWo3c3JLSVRnc0RiSCtZU2NWMFlPVE5ndHdpMWFQNmtPZmRlCk5hUTVZSTR3YlZqQzlkT0l5RkZzdTNaaEk5MG9vTXFFYmo3YWtPdkQwVE9aYmxDeFRNK0lKcXpTdzNMWmt5dGIKeWtPREUxWnVXM2Rtb1BCdE9pbHJFeDdIWUdrVWh6UFJrUUsxRVNmcno5a1dCRklxOTVlMngyQ3lLYlFTTnpybQpKd1VxTUNKR1dFR3FuMFVzNWNnOE1WNm54TXhUQUVieDFULzE3TWozT2RoS0VNMkFiM1l3aGlvOXBVSG0raEEwCnZCa09kZVl5TjdQTEJmaGlDWkNWbWNsNnE2TTRuc0N5N0pLQ2dxVGdIbzBoZVdGazh3VzlXMG5ia0NTZldSRTUKUHJERnp0SUVuSFcxMDE2YmZDYXRwckFFbTh0b0MvNk5XUT09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
  group: acme.neoncloud.io
  groupPriorityMinimum: 1000
  service:
    name: neon-acme
    namespace: neon-ingress
    port: 443
  version: v1alpha1
  versionPriority: 15
status:
  conditions:
  - lastTransitionTime: "2023-08-13T19:58:45Z"
    message: all checks passed
    reason: Passed
    status: "True"
    type: Available

This all seams a bit over complicated. Do we really need an APIService to proxy the headend service? Can't whoever is calling this (CertManager, LetsEncryopt, ZeroSSL,...) just hit our neon-acme service directly? Is the problem that this service needs to be secured by TLS and it can't because we need neon-acme to obtain the cert (chicken-and-egg)?

jefflill · 2023-08-16T00:24:23Z

This looks like it's related: failed with: OpenAPI spec does not exist

So, I saw that the Kubernetes API Aggregation Layer documentation mentions at the bottom here that the local proxied service needs to respond to discovery requests within 5 seconds without mentioning what a discovery request actually does.

I'll bet that the API Server periodically queries the API for its OpenAPI spec and and uses this to validate requests that pass thru the API Server proxy. Maybe we just need to have the neon-acme service generate an OpenAPI spec.

I see that neon-acme initializes Swagger but that it customizes the arguments. I'm going to try removing the customization and go with the defaults.

jefflill · 2023-08-17T20:18:13Z

Using default Swagger parameters didn't help.

I've been poking around the source code trying to figure out what's happening here and whether we can configure CertManager to hit the headend directly or perhaps though the neon-acme proxy but without the (second) APIService proxy/extension.

I see Helm chart where the v1alpha1.acme.neoncloud.io APIService proxy is being created
...but I don't see any references to the extension

So, I'm wondering if we're actually using this API service at all. I'm going to comment out the cluster setup code that creates this extension and see what happens. I'm going to leave the neon-acme service alone though, because it's needed to present the cluster's JWT to the headend.

JEFF UPDATE: removing the v1alpha1.acme.neoncloud.io APIService didn't work, so this must be referenced somewhere.

jefflill · 2023-08-17T23:35:22Z

Looking at this issue, It appears that the API server is expecting the OpenAPI spec to be hosted by neon-kube

here: /apis/acme.neoncloud.io/v1alpha1

Hmmm, I'm not sure that actually makes sense. That looks like the API server URI. I'm going to try enabling debug logging to the neon-acme service to see if I can see what the API server is hitting.

jefflill · 2023-08-18T00:17:21Z

link: It looks like the Swagger spec needs to be served from: /apis/v1alpha1

/apis/v1alpha1 from: /openapi/v3

jefflill · 2023-08-18T22:30:29Z

I no longer believe this is causing the performance problem. After letting a cluster run overnight, I see that neon-cluster-operator is busy slamming the API server.

We still have a problem with OpenAPI discovery. I've edited the issue title to reflect this.

jefflill · 2023-11-10T23:28:46Z

The performance issues ended up being due to Kubernetes watch and OperatorSDK related issues.

jefflill assigned marcusbooyah Aug 15, 2023

jefflill added bug Identifies a bug or other failure perf Perfomance related labels Aug 15, 2023

This was referenced Aug 15, 2023

neon-cluster-operator: performance issues after cluster runs for a while #1844

Closed

cluster operator slamming API server (after NDM filtering) #1842

Closed

jefflill changed the title ~~PERF: CertManager/ACME causing performance problem~~ PERF: CertManager/ACME causing cluster performance problem Aug 15, 2023

jefflill self-assigned this Aug 15, 2023

jefflill added a commit that referenced this issue Aug 16, 2023

#1847: Try using Swagger defaults

10f11a3

jefflill changed the title ~~PERF: CertManager/ACME causing cluster performance problem~~ PERF: CertManager/ACME causing cluster performance problem? Aug 17, 2023

jefflill added a commit that referenced this issue Aug 18, 2023

#1847: Changing neon-acme descobvery endpoint to:

2ba7f09

/apis/v1alpha1 from: /openapi/v3

jefflill changed the title ~~PERF: CertManager/ACME causing cluster performance problem?~~ PERF: CertManager/ACME OpenAPI discovery? Aug 18, 2023

jefflill changed the title ~~PERF: CertManager/ACME OpenAPI discovery?~~ CertManager/ACME OpenAPI discovery? Aug 18, 2023

jefflill removed the perf Perfomance related label Aug 18, 2023

jefflill mentioned this issue Aug 19, 2023

OperatorSDK issue after restarting neon-cluster-operator? #1852

Closed

jefflill closed this as completed Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CertManager/ACME OpenAPI discovery? #1847

CertManager/ACME OpenAPI discovery? #1847

jefflill commented Aug 15, 2023

jefflill commented Aug 15, 2023

jefflill commented Aug 16, 2023 •

edited

Loading

jefflill commented Aug 16, 2023 •

edited

Loading

jefflill commented Aug 17, 2023 •

edited

Loading

jefflill commented Aug 17, 2023 •

edited

Loading

jefflill commented Aug 18, 2023 •

edited

Loading

jefflill commented Aug 18, 2023

jefflill commented Nov 10, 2023

CertManager/ACME OpenAPI discovery? #1847

CertManager/ACME OpenAPI discovery? #1847

Comments

jefflill commented Aug 15, 2023

jefflill commented Aug 15, 2023

jefflill commented Aug 16, 2023 • edited Loading

jefflill commented Aug 16, 2023 • edited Loading

jefflill commented Aug 17, 2023 • edited Loading

jefflill commented Aug 17, 2023 • edited Loading

jefflill commented Aug 18, 2023 • edited Loading

jefflill commented Aug 18, 2023

jefflill commented Nov 10, 2023

jefflill commented Aug 16, 2023 •

edited

Loading

jefflill commented Aug 16, 2023 •

edited

Loading

jefflill commented Aug 17, 2023 •

edited

Loading

jefflill commented Aug 17, 2023 •

edited

Loading

jefflill commented Aug 18, 2023 •

edited

Loading