Do not materialise labels when comparing certs #256

kokes · 2024-03-04T11:11:09Z

Hey,
similar to some other issues submitted recently, we faced OOM kills when running
the exporter with a high number of k8s secrets. This PR seems to resolve #255 (the issue contains a reproducer).

While I thought the issue was the extremely memory intensive method getLabels, it
was in a way just a part of the issue. While this method does allocate heavily,
the main problem is that the exporter itself is slow. And since it doesn't have a
context, it is not cancelled and many concurrect collections happen, all allocating
and the process slowly gets OOM killed as memory mounts.

The new codebase allows for much faster parsing of certs (or, more specifically, their
deduplication). This then makes collection pretty much instant and no memory gets accumulated.

Here's how this looked and how it looks now with the fix:

High level perf stats

Original memory profile (when just parsing certs):

(pprof) top
Showing nodes accounting for 3186.67MB, 98.59% of 3232.31MB total
Dropped 125 nodes (cum <= 16.16MB)
Showing top 10 nodes out of 17
      flat  flat%   sum%        cum   cum%
 1872.50MB 57.93% 57.93%  2059.50MB 63.72%  github.com/enix/x509-certificate-exporter/v3/internal.fillLabelsFromName
  448.05MB 13.86% 71.79%   662.07MB 20.48%  path.Join
  323.58MB 10.01% 81.80%  1126.67MB 34.86%  github.com/enix/x509-certificate-exporter/v3/internal.(*Exporter).getBaseLabels
  186.50MB  5.77% 87.57%      187MB  5.79%  fmt.Sprintf
  141.02MB  4.36% 91.94%   141.02MB  4.36%  strings.genSplit
  113.51MB  3.51% 95.45%   113.51MB  3.51%  path.(*lazybuf).string (inline)
  100.51MB  3.11% 98.56%   100.51MB  3.11%  path.(*lazybuf).append (inline)
       1MB 0.031% 98.59%       23MB  0.71%  crypto/x509.CreateCertificate
         0     0% 98.59%  3186.17MB 98.57%  github.com/enix/x509-certificate-exporter/v3/internal.(*Exporter).compareCertificates
         0     0% 98.59%  3186.17MB 98.57%  github.com/enix/x509-certificate-exporter/v3/internal.(*Exporter).getLabels

New memory profile:

(pprof) top
Showing nodes accounting for 187.62MB, 63.16% of 297.06MB total
Dropped 88 nodes (cum <= 1.49MB)
Showing top 10 nodes out of 133
      flat  flat%   sum%        cum   cum%
      40MB 13.47% 13.47%       40MB 13.47%  github.com/prometheus/client_golang/prometheus.MakeLabelPairs
   36.01MB 12.12% 25.59%    36.01MB 12.12%  github.com/prometheus/client_golang/prometheus.v2.NewDesc
   19.52MB  6.57% 32.16%    66.53MB 22.40%  crypto/x509.parseCertificate
   19.01MB  6.40% 38.56%    19.51MB  6.57%  github.com/enix/x509-certificate-exporter/v3/internal.fillLabelsFromName
      19MB  6.40% 44.96%    30.50MB 10.27%  encoding/asn1.makeField
   17.50MB  5.89% 50.85%    17.50MB  5.89%  crypto/x509/pkix.(*Name).FillFromRDNSequence
      12MB  4.04% 54.89%    17.50MB  5.89%  crypto/x509.parseName
       9MB  3.03% 57.92%        9MB  3.03%  vendor/golang.org/x/crypto/cryptobyte.(*String).ReadASN1ObjectIdentifier
    8.50MB  2.86% 60.78%     8.50MB  2.86%  encoding/pem.Decode
    7.05MB  2.37% 63.16%     9.32MB  3.14%  compress/flate.NewWriter

Benchmark comparison (before/after):

x509-certificate-exporter okokes$ benchstat before.txt after.txt
goos: darwin
goarch: arm64
pkg: github.com/enix/x509-certificate-exporter/v3/internal
                      │   before.txt   │               after.txt                │
                      │     sec/op     │    sec/op     vs base                  │
ParsingCertificates-8   2685.78m ± 14%   39.92m ±  7%  -98.51% (p=0.000 n=10+7)

                      │   before.txt    │               after.txt                │
                      │      B/op       │     B/op      vs base                  │
ParsingCertificates-8   3182.520Mi ± 0%   9.326Mi ± 0%  -99.71% (p=0.000 n=10+7)

                      │  before.txt   │               after.txt               │
                      │   allocs/op   │  allocs/op   vs base                  │
ParsingCertificates-8   36376.7k ± 0%   125.1k ± 0%  -99.66% (p=0.000 n=10+7)

Changes made

added a Go benchmark, so that I could collect data
added a trimComponents function that exits early in case we don't trim components (and thus don't allocate a slice of strings, do path.Join etc.)
removed fmt.Sprintf in the metric construction to save ourselves an allocation
removed getLabels calls in the deduplication path - and inlined all the comparisons without allocating a map - this is the meat of this PR

I have some other changes drafted, but they have relatively low impact. Here are some numbers on the trimComponents part, which was perhaps the biggest
offendor after the getLabels stuff was sorted.

x509-certificate-exporter okokes$ benchstat before.txt after.txt
goos: darwin
goarch: arm64
pkg: github.com/enix/x509-certificate-exporter/v3/internal
                      │ before.txt  │              after.txt              │
                      │   sec/op    │   sec/op     vs base                │
ParsingCertificates-8   2.686 ± 14%   2.031 ± 12%  -24.40% (p=0.000 n=10)

                      │  before.txt  │              after.txt               │
                      │     B/op     │     B/op      vs base                │
ParsingCertificates-8   3.108Gi ± 0%   2.348Gi ± 0%  -24.44% (p=0.000 n=10)

                      │ before.txt  │              after.txt              │
                      │  allocs/op  │  allocs/op   vs base                │
ParsingCertificates-8   36.38M ± 0%   29.38M ± 0%  -19.23% (p=0.000 n=10)

kokes · 2024-03-04T11:35:50Z

Note that I have not run the test suite, because it fails to run not just for this PR, but also for main (I guess my environment is somehow broken).

Can you please run it, @npdgm? Thank you

kokes · 2024-03-06T07:43:57Z

Oh never mind, I got the tests to run against a fresh kind cluster and they pass.

plaffitt · 2024-03-22T13:43:56Z

Hi! Thank you very much for you work! Your explanations are well detailed and the optimization is clever. I think it may be a bit dangerous in a maintainability perspective, in a sense that if one day we modify the getLabels function, we will have to remember to update the compareCertificates as well. But I would say that the benefits are worth the risk. Anyway, I will try to find a way to tackle this issue later, but for now I will merge your work.

Thanks again! 🚀

monkeynator · 2024-04-05T10:49:06Z

🎉 This PR is included in version 3.13.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Do not materialise labels when comparing certs

e874948

plaffitt merged commit 03c48bc into enix:main Mar 22, 2024
3 checks passed

monkeynator added the released label Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not materialise labels when comparing certs #256

Do not materialise labels when comparing certs #256

kokes commented Mar 4, 2024 •

edited

Loading

kokes commented Mar 4, 2024

kokes commented Mar 6, 2024

plaffitt commented Mar 22, 2024

monkeynator commented Apr 5, 2024

Do not materialise labels when comparing certs #256

Do not materialise labels when comparing certs #256

Conversation

kokes commented Mar 4, 2024 • edited Loading

High level perf stats

Changes made

kokes commented Mar 4, 2024

kokes commented Mar 6, 2024

plaffitt commented Mar 22, 2024

monkeynator commented Apr 5, 2024

kokes commented Mar 4, 2024 •

edited

Loading