Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Index out of range in metrics_store.SanitizeHeaders #2166

Merged
merged 1 commit into from
Aug 25, 2023

Conversation

mrueg
Copy link
Member

@mrueg mrueg commented Aug 24, 2023

What this PR does / why we need it:
We observed a panic that did not crash ksm but could have been avoided.

2023-08-24T18:08:21.796778057Z 2023/08/24 18:08:21 http: panic serving ip:port: runtime error: index out of range [0] with length 0 
2023-08-24T18:08:21.796814387Z goroutine 589687 [running]: 
2023-08-24T18:08:21.796823334Z net/http.(*conn).serve.func1() 
2023-08-24T18:08:21.796830520Z  /usr/local/go/src/net/http/server.go:1854 +0xbf 
2023-08-24T18:08:21.796838117Z panic({0x19c2060, 0xc0043060a8}) 
2023-08-24T18:08:21.796844764Z  /usr/local/go/src/runtime/panic.go:890 +0x263 
2023-08-24T18:08:21.796852064Z k8s.io/kube-state-metrics/v2/pkg/metrics_store.SanitizeHeaders(...) 2023-08-24T18:08:21.796858974Z  /root/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_writer.go:94 2023-08-24T18:08:21.796866007Z k8s.io/kube-state-metrics/v2/pkg/metricshandler.(*MetricsHandler).ServeHTTP(0xc000221490, {0x1d425a0, 0xc005872700}, 0xc003f9d300) 2023-08-24T18:08:21.796873054Z  /root/go/src/k8s.io/kube-state-metrics/pkg/metricshandler/metrics_handler.go:211 +0x8fe ...

I'm not sure if this is the correct fix, it should resolve the symptom but the question is why or if any writer should have an empty store in the first place. @rexagod @dgrisonnet if you have any ideas, please let me know.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
None
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@mrueg mrueg requested a review from rexagod August 24, 2023 18:25
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 24, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Aug 24, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 24, 2023
@mrueg mrueg changed the title fix: Index out of range in metrics_writer fix: Index out of range in metrics_store.SanitizeHeaders Aug 24, 2023
2023-08-24T18:08:21.796778057Z 2023/08/24 18:08:21 http: panic serving ip:port: runtime error: index out of range [0] with length 0
2023-08-24T18:08:21.796814387Z goroutine 589687 [running]:
2023-08-24T18:08:21.796823334Z net/http.(*conn).serve.func1()
2023-08-24T18:08:21.796830520Z  /usr/local/go/src/net/http/server.go:1854 +0xbf
2023-08-24T18:08:21.796838117Z panic({0x19c2060, 0xc0043060a8})
2023-08-24T18:08:21.796844764Z  /usr/local/go/src/runtime/panic.go:890 +0x263
2023-08-24T18:08:21.796852064Z k8s.io/kube-state-metrics/v2/pkg/metrics_store.SanitizeHeaders(...)
2023-08-24T18:08:21.796858974Z  /root/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_writer.go:94
2023-08-24T18:08:21.796866007Z k8s.io/kube-state-metrics/v2/pkg/metricshandler.(*MetricsHandler).ServeHTTP(0xc000221490, {0x1d425a0, 0xc005872700}, 0xc003f9d300)
2023-08-24T18:08:21.796873054Z  /root/go/src/k8s.io/kube-state-metrics/pkg/metricshandler/metrics_handler.go:211 +0x8fe
...

Signed-off-by: Manuel Rüger <manuel@rueg.eu>
@dgrisonnet
Copy link
Member

Oh wow that seems pretty bad.

@mrueg did it crashed out of nowhere or was it shortly after startup?

@dgrisonnet
Copy link
Member

dgrisonnet commented Aug 24, 2023

FWIW WriteAll() has the same check: https://github.com/kubernetes/kube-state-metrics/blob/main/pkg/metrics_store/metrics_writer.go#L50-L52, but I don't know yet how we could end up in that situation I will investigate.

@mrueg
Copy link
Member Author

mrueg commented Aug 25, 2023

It crashed internally, but stayed alive (so /health stayed healthy and k8s didn't kill the pod)
I assume constructor(b) might be empty for some reason.
https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/builder.go#L257

@rexagod
Copy link
Member

rexagod commented Aug 25, 2023

I wasn't able to reproduce this with: (a) an empty KSM CRS config, (b) an empty --resources "" or a --resources flag with a native object that's excluded from the native set of default enabled resources (eg., clusterroles) (stores are built only for resources which are enabled) and (c) an empty KSM CRS config with --custom-resource-state-only set.

To sum it up, in a non-CRS mode, writer stores will, at the very least, have a length of O(DefaultResources) or O(len(--resources)). In a CRS mode, things like an empty config., or explicitly not providing the header data (#TYPE/#HELP) will result in an empty writer object, or a store with a metric UID key with no value, respectively. Neither of these cases crash KSM, which I've also verified a couple minutes ago, just to be sure.

But, of course, there's clearly something I'm missing, which I cannot diagnose further without more information about the environment KSM was running under, and how, because the writer stores (not the writer itself) have indeed achieved a zero length somehow, so the cautious fix at hand seems the best way forward in hindsight.

@rexagod
Copy link
Member

rexagod commented Aug 25, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg, rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 6b1daa7 into kubernetes:main Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants