-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CertManager/ACME OpenAPI discovery? #1847
Comments
@marcusbooyah: I'm not entirely sure what's happening here. It appears that you've defined and API server aggregation endpoint called v1alpha1.acme.neoncloud.io that forwards traffic to the neon-acme deployment which then forwards traffic to the neon-cloud headend. I've listed the API services and I do see v1alpha1.acme.neoncloud.io listed there and it looks like it's referencing neon-ingress/neon-acme correctly. I set the neon-acme log level to debug and captured these logs:
...so it looks like the API calls are being forwarded by the API server to neon-acme. We don't see the response status code in the logs. |
@marcusbooyah: I poked around the Kubernetes AggregationController source code and it looks like the API server might be trying trying to retrieve information about the local service (neon-acme in this case) and that's failing. The GOLANG code is hard to follow and it'd not clear what API server is looking for. The code seems to be related to monitoring the APIService for changes and this is failing and it's failing often enough that operation is being rate-limited, which explains the CPU load we're seeing. The API Service looks OK though:
This all seams a bit over complicated. Do we really need an APIService to proxy the headend service? Can't whoever is calling this (CertManager, LetsEncryopt, ZeroSSL,...) just hit our neon-acme service directly? Is the problem that this service needs to be secured by TLS and it can't because we need neon-acme to obtain the cert (chicken-and-egg)? |
This looks like it's related: failed with: OpenAPI spec does not exist So, I saw that the Kubernetes API Aggregation Layer documentation mentions at the bottom here that the local proxied service needs to respond to discovery requests within 5 seconds without mentioning what a discovery request actually does. I'll bet that the API Server periodically queries the API for its OpenAPI spec and and uses this to validate requests that pass thru the API Server proxy. Maybe we just need to have the neon-acme service generate an OpenAPI spec. I see that neon-acme initializes Swagger but that it customizes the arguments. I'm going to try removing the customization and go with the defaults. |
Using default Swagger parameters didn't help. I've been poking around the source code trying to figure out what's happening here and whether we can configure CertManager to hit the headend directly or perhaps though the neon-acme proxy but without the (second) APIService proxy/extension.
So, I'm wondering if we're actually using this API service at all. I'm going to comment out the cluster setup code that creates this extension and see what happens. I'm going to leave the neon-acme service alone though, because it's needed to present the cluster's JWT to the headend. JEFF UPDATE: removing the v1alpha1.acme.neoncloud.io APIService didn't work, so this must be referenced somewhere. |
Looking at this issue, It appears that the API server is expecting the OpenAPI spec to be hosted by neon-kube here: /apis/acme.neoncloud.io/v1alpha1 Hmmm, I'm not sure that actually makes sense. That looks like the API server URI. I'm going to try enabling debug logging to the neon-acme service to see if I can see what the API server is hitting. |
link: It looks like the Swagger spec needs to be served from: /apis/v1alpha1 |
I no longer believe this is causing the performance problem. After letting a cluster run overnight, I see that neon-cluster-operator is busy slamming the API server. We still have a problem with OpenAPI discovery. I've edited the issue title to reflect this. |
The performance issues ended up being due to Kubernetes watch and OperatorSDK related issues. |
It looks like the API server's performance performance problems are actually related to CertManager/ACME right now instead of neon-cluster-operator who also had issues in the past, but it looks like those have been fixed.
This problem doesn't surface right away. I was able to repro this by deploying a desktop cluster and letting it run overnight. It takes a while for the cluster CPU utilization to start spiking, with the API server consuming over 1 CPU.
I looked at the API server logs and see that the API server is being slammed by these:
The text was updated successfully, but these errors were encountered: