fix: Index out of range in metrics_store.SanitizeHeaders #2166

mrueg · 2023-08-24T18:25:51Z

What this PR does / why we need it:
We observed a panic that did not crash ksm but could have been avoided.

2023-08-24T18:08:21.796778057Z 2023/08/24 18:08:21 http: panic serving ip:port: runtime error: index out of range [0] with length 0 
2023-08-24T18:08:21.796814387Z goroutine 589687 [running]: 
2023-08-24T18:08:21.796823334Z net/http.(*conn).serve.func1() 
2023-08-24T18:08:21.796830520Z  /usr/local/go/src/net/http/server.go:1854 +0xbf 
2023-08-24T18:08:21.796838117Z panic({0x19c2060, 0xc0043060a8}) 
2023-08-24T18:08:21.796844764Z  /usr/local/go/src/runtime/panic.go:890 +0x263 
2023-08-24T18:08:21.796852064Z k8s.io/kube-state-metrics/v2/pkg/metrics_store.SanitizeHeaders(...) 2023-08-24T18:08:21.796858974Z  /root/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_writer.go:94 2023-08-24T18:08:21.796866007Z k8s.io/kube-state-metrics/v2/pkg/metricshandler.(*MetricsHandler).ServeHTTP(0xc000221490, {0x1d425a0, 0xc005872700}, 0xc003f9d300) 2023-08-24T18:08:21.796873054Z  /root/go/src/k8s.io/kube-state-metrics/pkg/metricshandler/metrics_handler.go:211 +0x8fe ...

I'm not sure if this is the correct fix, it should resolve the symptom but the question is why or if any writer should have an empty store in the first place. @rexagod @dgrisonnet if you have any ideas, please let me know.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
None
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

k8s-ci-robot · 2023-08-24T18:25:58Z

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

2023-08-24T18:08:21.796778057Z 2023/08/24 18:08:21 http: panic serving ip:port: runtime error: index out of range [0] with length 0 2023-08-24T18:08:21.796814387Z goroutine 589687 [running]: 2023-08-24T18:08:21.796823334Z net/http.(*conn).serve.func1() 2023-08-24T18:08:21.796830520Z /usr/local/go/src/net/http/server.go:1854 +0xbf 2023-08-24T18:08:21.796838117Z panic({0x19c2060, 0xc0043060a8}) 2023-08-24T18:08:21.796844764Z /usr/local/go/src/runtime/panic.go:890 +0x263 2023-08-24T18:08:21.796852064Z k8s.io/kube-state-metrics/v2/pkg/metrics_store.SanitizeHeaders(...) 2023-08-24T18:08:21.796858974Z /root/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_writer.go:94 2023-08-24T18:08:21.796866007Z k8s.io/kube-state-metrics/v2/pkg/metricshandler.(*MetricsHandler).ServeHTTP(0xc000221490, {0x1d425a0, 0xc005872700}, 0xc003f9d300) 2023-08-24T18:08:21.796873054Z /root/go/src/k8s.io/kube-state-metrics/pkg/metricshandler/metrics_handler.go:211 +0x8fe ... Signed-off-by: Manuel Rüger <manuel@rueg.eu>

dgrisonnet · 2023-08-24T19:14:03Z

Oh wow that seems pretty bad.

@mrueg did it crashed out of nowhere or was it shortly after startup?

dgrisonnet · 2023-08-24T19:15:38Z

FWIW WriteAll() has the same check: https://github.com/kubernetes/kube-state-metrics/blob/main/pkg/metrics_store/metrics_writer.go#L50-L52, but I don't know yet how we could end up in that situation I will investigate.

mrueg · 2023-08-25T07:55:43Z

It crashed internally, but stayed alive (so /health stayed healthy and k8s didn't kill the pod)
I assume constructor(b) might be empty for some reason.
https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/builder.go#L257

rexagod · 2023-08-25T12:39:59Z

I wasn't able to reproduce this with: (a) an empty KSM CRS config, (b) an empty --resources "" or a --resources flag with a native object that's excluded from the native set of default enabled resources (eg., clusterroles) (stores are built only for resources which are enabled) and (c) an empty KSM CRS config with --custom-resource-state-only set.

To sum it up, in a non-CRS mode, writer stores will, at the very least, have a length of O(DefaultResources) or O(len(--resources)). In a CRS mode, things like an empty config., or explicitly not providing the header data (#TYPE/#HELP) will result in an empty writer object, or a store with a metric UID key with no value, respectively. Neither of these cases crash KSM, which I've also verified a couple minutes ago, just to be sure.

But, of course, there's clearly something I'm missing, which I cannot diagnose further without more information about the environment KSM was running under, and how, because the writer stores (not the writer itself) have indeed achieved a zero length somehow, so the cautious fix at hand seems the best way forward in hindsight.

rexagod · 2023-08-25T15:23:02Z

/lgtm

k8s-ci-robot · 2023-08-25T15:23:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg, rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mrueg,rexagod]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mrueg requested a review from rexagod August 24, 2023 18:25

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 24, 2023

k8s-ci-robot requested review from dgrisonnet and logicalhan August 24, 2023 18:25

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Aug 24, 2023

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 24, 2023

mrueg changed the title ~~fix: Index out of range in metrics_writer~~ fix: Index out of range in metrics_store.SanitizeHeaders Aug 24, 2023

mrueg force-pushed the fix-panic branch from 4d7302f to 1536c72 Compare August 24, 2023 18:28

k8s-ci-robot assigned rexagod Aug 25, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2023

k8s-ci-robot merged commit 6b1daa7 into kubernetes:main Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Index out of range in metrics_store.SanitizeHeaders #2166

fix: Index out of range in metrics_store.SanitizeHeaders #2166

mrueg commented Aug 24, 2023

k8s-ci-robot commented Aug 24, 2023

dgrisonnet commented Aug 24, 2023

dgrisonnet commented Aug 24, 2023 •

edited

Loading

mrueg commented Aug 25, 2023

rexagod commented Aug 25, 2023

rexagod commented Aug 25, 2023

k8s-ci-robot commented Aug 25, 2023

fix: Index out of range in metrics_store.SanitizeHeaders #2166

fix: Index out of range in metrics_store.SanitizeHeaders #2166

Conversation

mrueg commented Aug 24, 2023

k8s-ci-robot commented Aug 24, 2023

dgrisonnet commented Aug 24, 2023

dgrisonnet commented Aug 24, 2023 • edited Loading

mrueg commented Aug 25, 2023

rexagod commented Aug 25, 2023

rexagod commented Aug 25, 2023

k8s-ci-robot commented Aug 25, 2023

dgrisonnet commented Aug 24, 2023 •

edited

Loading