Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Metrics Controllers Memory Leak #246

Merged
merged 1 commit into from
Mar 20, 2023

Conversation

engedaam
Copy link
Contributor

@engedaam engedaam commented Mar 20, 2023

Fixes #3209
Description

New Karpenter pod memory profile is shown bellow. Below is the top 10 blocks of memory for the karpenter controller.

      flat  flat%   sum%        cum   cum%
 5633.31kB 21.72% 21.72% 12318.57kB 47.49%  encoding/json.(*decodeState).objectInterface
 4631.90kB 17.86% 39.58%  5149.23kB 19.85%  encoding/json.unquote (inline)
 2644.87kB 10.20% 49.77%  2644.87kB 10.20%  encoding/json.(*Decoder).refill
 1536.51kB  5.92% 55.70%  1536.51kB  5.92%  go.uber.org/zap/zapcore.newCounters (inline)
 1536.23kB  5.92% 61.62%  1536.23kB  5.92%  github.com/aws/aws-sdk-go/aws/endpoints.init
 1536.02kB  5.92% 67.54%  5149.22kB 19.85%  encoding/json.(*decodeState).literalInterface
 1097.69kB  4.23% 71.77%  2133.79kB  8.23%  k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName
  600.58kB  2.32% 74.09%   600.58kB  2.32%  github.com/go-playground/validator/v10.init
  532.26kB  2.05% 76.14%   532.26kB  2.05%  github.com/gogo/protobuf/proto.RegisterType
  524.09kB  2.02% 78.16%   524.09kB  2.02%  k8s.io/apimachinery/pkg/conversion.ConversionFuncs.AddUntyped

After running karpenter for 10 hours, we see a sharpe increase in the memory used by the karpenter controller. Below is the top 10 blocks of memory for the karpenter controller pod.

      flat  flat%   sum%        cum   cum%
   37.51MB 23.66% 23.66%    37.51MB 23.66%  github.com/aws/karpenter-core/pkg/controllers/metrics/provisioner.(*Controller).makeLabels (inline)
   22.51MB 14.20% 37.86%    25.51MB 16.09%  github.com/aws/karpenter-core/pkg/controllers/metrics/pod.(*Controller).makeLabels
   16.13MB 10.17% 48.03%    16.13MB 10.17%  strings.(*Builder).grow (inline)
   13.60MB  8.58% 56.61%    13.60MB  8.58%  fmt.Sprintf
   13.50MB  8.52% 65.13%    18.50MB 11.67%  github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil.XMLToStruct
    4.50MB  2.84% 67.97%     4.50MB  2.84%  github.com/aws/aws-sdk-go/private/protocol/xml/xmlutil.(*XMLNode).findNamespaces (inline)
    4.50MB  2.84% 70.80%     5.50MB  3.47%  k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).Unmarshal
       4MB  2.52% 73.33%     5.50MB  3.47%  k8s.io/api/core/v1.(*PodSpec).Unmarshal
    3.57MB  2.25% 75.58%     3.57MB  2.25%  sync.(*Map).dirtyLocked
    3.50MB  2.21% 77.79%     3.50MB  2.21%  k8s.io/apimachinery/pkg/util/sets.String.Insert (inline)

We can see that there is lots of data being stored in both metrics provisioner/pod controllers ware there should not be 37.51 MB and 25.51 MB of memory being used by the controllers. This suggest a memory leak at these location.
github.com/aws/karpenter-core/pkg/controllers/metrics/provisioner.(*Controller).makeLabels
github.com/aws/karpenter-core/pkg/controllers/metrics/pod.(*Controller).makeLabels

How was this change tested?

In a 10 hour period, we see with node/pod churn the memory usage of Karpenter grows as a factor of time.
Current Karpenter
The memory usage of the karpenter will look as such in a 10 hour period with the implemented fix:
Fixed Karpenter

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@coveralls
Copy link

coveralls commented Mar 20, 2023

Pull Request Test Coverage Report for Build 4470915009

  • 8 of 8 (100.0%) changed or added relevant lines in 2 files are covered.
  • 9 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.1%) to 80.679%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/machine/liveness.go 2 85.0%
pkg/controllers/provisioning/scheduling/preferences.go 7 86.67%
Totals Coverage Status
Change from base Build 4438814623: -0.1%
Covered Lines: 6556
Relevant Lines: 8126

💛 - Coveralls

@engedaam engedaam marked this pull request as ready for review March 20, 2023 21:25
@engedaam engedaam requested a review from a team as a code owner March 20, 2023 21:25
@engedaam engedaam requested a review from njtran March 20, 2023 21:25
Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing analysis! Great work. Are you able to post a picture of the pod and provisioner dashboards from grafana that you saw to validate we're not messing up anything on that end?

@engedaam
Copy link
Contributor Author

The Current provisioner/pod metrics dashboard:
Curr Karpenter
The fixed provisioner/pod metrics dashboard:
Fix Karpenter

@engedaam engedaam merged commit 0d99636 into kubernetes-sigs:main Mar 20, 2023
@engedaam engedaam deleted the fix-memory-leak branch March 20, 2023 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Controller memory allocation steadily increasing over time
3 participants