Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-apiserver persistently high memory usage with large number of CRDs #2920

Closed
1 of 6 tasks
matthchr opened this issue Apr 25, 2023 · 1 comment · Fixed by #3007
Closed
1 of 6 tasks

kube-apiserver persistently high memory usage with large number of CRDs #2920

matthchr opened this issue Apr 25, 2023 · 1 comment · Fixed by #3007
Assignees
Labels
high-priority Issues we intend to prioritize (security, outage, blocking bug)
Milestone

Comments

@matthchr
Copy link
Member

matthchr commented Apr 25, 2023

kube-apiserver uses a surprisingly large amount of memory when CRDs are added.

ASO v2.0.0 installs ~125 CRDs. Testing quickly on a local kind cluster (version 1.26.3) shows that resident memory grows from:
304mb (empty cluster) -> 380mb (cert-manager installed) -> 2.5g (aso just installed) -> 2.0g (steady state).

This means that each of our 125 CRDs causes ~13mb of memory usage in kube-apiserver.

Customer impact

It's very easy to cross managed cluster kube-apiserver memory limits. If this happens, kube-apiserver may get OOMKilled by the managed cluster provider (such as AKS). This has negative downstream effects on pretty much everything in the cluster, including triggering pod restarts due to expired watches, failed kube api requests, etc.

Quick instructions for profiling kind apiserver

  1. Start kind.
  2. Install aso
  3. kubectl proxy --port=8080 &
  4. go tool pprof -png "http://localhost:8080/debug/pprof/heap" > out.png - The actual trailing path can be anything: /debug/pprof/profile, debug/pprof/heap, etc
  5. Or alternatively you can do curl localhost:8080/debug/pprof/heap > out.pprof and then analyze it with go tool pprof out.pprof

Experiments

  • Is memory usage reduced if CRDs have no documentation included?
  • Is memory usage reduced if v1beta versions are removed?
  • Examine in detail w/ pprof where kube-apiserver memory usage is coming from

Prior art

Some key snippets from Crossplane's summary:

Further investigation reveals that zapcore.newCounters has allocated 815 objects in the process heap (where there are 780 CRDs) and each allocation is _numLevels * _countersPerLevel * (Int64 + Uint64) = 7 * 4096 * (8+8) = 448 KB! Thus, for 780 CRDs (plus other zap loggers), we are using 815 * 448 KB = ~356 MB of heap space. zapcore.newCounters is responsible for the ~31% of the heap space allocated!

On large CRDs:

However, to exacerbate the situation, we have generated ~190 CRDs each with 5000 OpenAPI v3 schema properties. Even with only 190 such CRDs, heap profiling data reveals that ~5 GB of heap space is allocated and zapcore.newCounters is no longer the largest allocation site anymore, as expected. This profiling data better emphasizes allocation sites which are inflated with large (deserialized) OpenAPI v3 CRD schemas. One can generate such CRDs using the following command:

Proposed fixes

@matthchr
Copy link
Member Author

matthchr commented Apr 25, 2023

Here's a pprof for kube-apiserver from a local kind cluster for ASO v2.0.0:

Interestingly it only shows 1.1g used whereas resident memory was 2.0+g, I'm not sure where that discrepancy comes from, will need to dig further possibly:
aso-v2-0-0-apiserver

@matthchr matthchr added the high-priority Issues we intend to prioritize (security, outage, blocking bug) label May 4, 2023
@matthchr matthchr self-assigned this May 23, 2023
matthchr added a commit to matthchr/azure-service-operator that referenced this issue May 23, 2023
Fixes Azure#1433.
Fixes Azure#2920.

* By default, no CRDs are installed.
* If no CRDs are installed, the operator pod will exit with an error
  stating that there are no CRDs.
* Operator pod --crd-pattern command-line argument will now accept more
  than '*'.
* --crd-pattern means NEW CRDs to install. Existing CRDs in the cluster
  will always be upgraded. This means that upgrading an existing ASO
  installation without specifying any new CRDs will upgrade all of the
  existing CRDs and install no new CRDs.
matthchr added a commit to matthchr/azure-service-operator that referenced this issue May 23, 2023
Fixes Azure#1433.
Fixes Azure#2920.

* By default, no CRDs are installed.
* If no CRDs are installed, the operator pod will exit with an error
  stating that there are no CRDs.
* Operator pod --crd-pattern command-line argument will now accept more
  than '*'.
* --crd-pattern means NEW CRDs to install. Existing CRDs in the cluster
  will always be upgraded. This means that upgrading an existing ASO
  installation without specifying any new CRDs will upgrade all of the
  existing CRDs and install no new CRDs.
matthchr added a commit to matthchr/azure-service-operator that referenced this issue May 24, 2023
Fixes Azure#1433.
Fixes Azure#2920.

* By default, no CRDs are installed.
* If no CRDs are installed, the operator pod will exit with an error
  stating that there are no CRDs.
* Operator pod --crd-pattern command-line argument will now accept more
  than '*'.
* --crd-pattern means NEW CRDs to install. Existing CRDs in the cluster
  will always be upgraded. This means that upgrading an existing ASO
  installation without specifying any new CRDs will upgrade all of the
  existing CRDs and install no new CRDs.
matthchr added a commit to matthchr/azure-service-operator that referenced this issue May 24, 2023
Fixes Azure#1433.
Fixes Azure#2920.

* By default, no CRDs are installed.
* If no CRDs are installed, the operator pod will exit with an error
  stating that there are no CRDs.
* Operator pod --crd-pattern command-line argument will now accept more
  than '*'.
* --crd-pattern means NEW CRDs to install. Existing CRDs in the cluster
  will always be upgraded. This means that upgrading an existing ASO
  installation without specifying any new CRDs will upgrade all of the
  existing CRDs and install no new CRDs.
matthchr added a commit to matthchr/azure-service-operator that referenced this issue May 24, 2023
Fixes Azure#1433.
Fixes Azure#2920.

* By default, no CRDs are installed.
* If no CRDs are installed, the operator pod will exit with an error
  stating that there are no CRDs.
* Operator pod --crd-pattern command-line argument will now accept more
  than '*'.
* --crd-pattern means NEW CRDs to install. Existing CRDs in the cluster
  will always be upgraded. This means that upgrading an existing ASO
  installation without specifying any new CRDs will upgrade all of the
  existing CRDs and install no new CRDs.
@github-project-automation github-project-automation bot moved this from Backlog to Recently Completed in Azure Service Operator Roadmap May 24, 2023
@matthchr matthchr moved this from Recently Completed to Ready for Release in Azure Service Operator Roadmap May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high-priority Issues we intend to prioritize (security, outage, blocking bug)
Projects
Development

Successfully merging a pull request may close this issue.

1 participant