Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not materialise labels when comparing certs #256

Merged
merged 1 commit into from
Mar 22, 2024
Merged

Conversation

kokes
Copy link
Contributor

@kokes kokes commented Mar 4, 2024

Hey,
similar to some other issues submitted recently, we faced OOM kills when running
the exporter with a high number of k8s secrets. This PR seems to resolve #255 (the issue contains a reproducer).

While I thought the issue was the extremely memory intensive method getLabels, it
was in a way just a part of the issue. While this method does allocate heavily,
the main problem is that the exporter itself is slow. And since it doesn't have a
context, it is not cancelled and many concurrect collections happen, all allocating
and the process slowly gets OOM killed as memory mounts.

The new codebase allows for much faster parsing of certs (or, more specifically, their
deduplication). This then makes collection pretty much instant and no memory gets accumulated.

Here's how this looked and how it looks now with the fix:

Screenshot 2024-03-04 at 11 07 54@2x

High level perf stats

Original memory profile (when just parsing certs):

(pprof) top
Showing nodes accounting for 3186.67MB, 98.59% of 3232.31MB total
Dropped 125 nodes (cum <= 16.16MB)
Showing top 10 nodes out of 17
      flat  flat%   sum%        cum   cum%
 1872.50MB 57.93% 57.93%  2059.50MB 63.72%  github.com/enix/x509-certificate-exporter/v3/internal.fillLabelsFromName
  448.05MB 13.86% 71.79%   662.07MB 20.48%  path.Join
  323.58MB 10.01% 81.80%  1126.67MB 34.86%  github.com/enix/x509-certificate-exporter/v3/internal.(*Exporter).getBaseLabels
  186.50MB  5.77% 87.57%      187MB  5.79%  fmt.Sprintf
  141.02MB  4.36% 91.94%   141.02MB  4.36%  strings.genSplit
  113.51MB  3.51% 95.45%   113.51MB  3.51%  path.(*lazybuf).string (inline)
  100.51MB  3.11% 98.56%   100.51MB  3.11%  path.(*lazybuf).append (inline)
       1MB 0.031% 98.59%       23MB  0.71%  crypto/x509.CreateCertificate
         0     0% 98.59%  3186.17MB 98.57%  github.com/enix/x509-certificate-exporter/v3/internal.(*Exporter).compareCertificates
         0     0% 98.59%  3186.17MB 98.57%  github.com/enix/x509-certificate-exporter/v3/internal.(*Exporter).getLabels

New memory profile:

(pprof) top
Showing nodes accounting for 187.62MB, 63.16% of 297.06MB total
Dropped 88 nodes (cum <= 1.49MB)
Showing top 10 nodes out of 133
      flat  flat%   sum%        cum   cum%
      40MB 13.47% 13.47%       40MB 13.47%  github.com/prometheus/client_golang/prometheus.MakeLabelPairs
   36.01MB 12.12% 25.59%    36.01MB 12.12%  github.com/prometheus/client_golang/prometheus.v2.NewDesc
   19.52MB  6.57% 32.16%    66.53MB 22.40%  crypto/x509.parseCertificate
   19.01MB  6.40% 38.56%    19.51MB  6.57%  github.com/enix/x509-certificate-exporter/v3/internal.fillLabelsFromName
      19MB  6.40% 44.96%    30.50MB 10.27%  encoding/asn1.makeField
   17.50MB  5.89% 50.85%    17.50MB  5.89%  crypto/x509/pkix.(*Name).FillFromRDNSequence
      12MB  4.04% 54.89%    17.50MB  5.89%  crypto/x509.parseName
       9MB  3.03% 57.92%        9MB  3.03%  vendor/golang.org/x/crypto/cryptobyte.(*String).ReadASN1ObjectIdentifier
    8.50MB  2.86% 60.78%     8.50MB  2.86%  encoding/pem.Decode
    7.05MB  2.37% 63.16%     9.32MB  3.14%  compress/flate.NewWriter

Benchmark comparison (before/after):

x509-certificate-exporter okokes$ benchstat before.txt after.txt
goos: darwin
goarch: arm64
pkg: github.com/enix/x509-certificate-exporter/v3/internal
                      │   before.txt   │               after.txt                │
                      │     sec/op     │    sec/op     vs base                  │
ParsingCertificates-8   2685.78m ± 14%   39.92m ±  7%  -98.51% (p=0.000 n=10+7)

                      │   before.txt    │               after.txt                │
                      │      B/op       │     B/op      vs base                  │
ParsingCertificates-8   3182.520Mi ± 0%   9.326Mi ± 0%  -99.71% (p=0.000 n=10+7)

                      │  before.txt   │               after.txt               │
                      │   allocs/op   │  allocs/op   vs base                  │
ParsingCertificates-8   36376.7k ± 0%   125.1k ± 0%  -99.66% (p=0.000 n=10+7)

Changes made

  • added a Go benchmark, so that I could collect data
  • added a trimComponents function that exits early in case we don't trim components (and thus don't allocate a slice of strings, do path.Join etc.)
  • removed fmt.Sprintf in the metric construction to save ourselves an allocation
  • removed getLabels calls in the deduplication path - and inlined all the comparisons without allocating a map - this is the meat of this PR

I have some other changes drafted, but they have relatively low impact. Here are some numbers on the trimComponents part, which was perhaps the biggest
offendor after the getLabels stuff was sorted.

x509-certificate-exporter okokes$ benchstat before.txt after.txt
goos: darwin
goarch: arm64
pkg: github.com/enix/x509-certificate-exporter/v3/internal
                      │ before.txt  │              after.txt              │
                      │   sec/op    │   sec/op     vs base                │
ParsingCertificates-8   2.686 ± 14%   2.031 ± 12%  -24.40% (p=0.000 n=10)

                      │  before.txt  │              after.txt               │
                      │     B/op     │     B/op      vs base                │
ParsingCertificates-8   3.108Gi ± 0%   2.348Gi ± 0%  -24.44% (p=0.000 n=10)

                      │ before.txt  │              after.txt              │
                      │  allocs/op  │  allocs/op   vs base                │
ParsingCertificates-8   36.38M ± 0%   29.38M ± 0%  -19.23% (p=0.000 n=10)

@kokes
Copy link
Contributor Author

kokes commented Mar 4, 2024

Note that I have not run the test suite, because it fails to run not just for this PR, but also for main (I guess my environment is somehow broken).

Can you please run it, @npdgm? Thank you

@kokes
Copy link
Contributor Author

kokes commented Mar 6, 2024

Oh never mind, I got the tests to run against a fresh kind cluster and they pass.

@plaffitt
Copy link
Contributor

Hi! Thank you very much for you work! Your explanations are well detailed and the optimization is clever. I think it may be a bit dangerous in a maintainability perspective, in a sense that if one day we modify the getLabels function, we will have to remember to update the compareCertificates as well. But I would say that the benefits are worth the risk. Anyway, I will try to find a way to tackle this issue later, but for now I will merge your work.

Thanks again! 🚀

@plaffitt plaffitt merged commit 03c48bc into enix:main Mar 22, 2024
3 checks passed
@monkeynator
Copy link
Member

🎉 This PR is included in version 3.13.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concurrent collector runs lead to an OOM kill
3 participants