Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-apiserver SLO possible calculation issue #498

Closed
dghubble opened this issue Sep 13, 2020 · 4 comments
Closed

kube-apiserver SLO possible calculation issue #498

dghubble opened this issue Sep 13, 2020 · 4 comments

Comments

@dghubble
Copy link
Contributor

dghubble commented Sep 13, 2020

I'm considering the kube-apiserver SLO rules, alerts, and dashboard, which seem to have been added since last I looked at this repo. However, the rules produce wildly unexpected values (negative availability, -6000% error budget, best case 40% etc), on clusters with healthy apiservers.

Let's consider apiserver_request:availability for writes. At a high level, this tries to measure 1 - (slow requests + error requests) / total reqests. For me, it evaluates to ~0.40 with slow requests being the supposed contributor (the error part yields 0).

Looking closer shows an issue. The query uses a histogram and basically subtracts the "slow requests" taking longer than 1 second from the total count of request events in the histogram.

 # too slow
sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d]))
-
sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))

But these aren't actually measuring the same classes of requests. The assumption in the query above is that these would be equal in an ideal case (all requests with latency less than infinity == all requests).

sum(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"})  6388
sum(apiserver_request_duration_seconds_bucket{le="+Inf",verb=~"POST|PUT|PATCH|DELETE"})  15287

seconds_bucket records requests for core API objects (nodes, secrets, configmaps) only. seconds_count records requests for additional API objects too (certificatesigningrequests, tokenreviews, customresourcedefinitions, endpointslices, networkpolicies). As a result, the query mostly just measures the difference in usage btw core and other groups.

sum(apiserver_request_duration_seconds_bucket{le="+Inf",verb=~"POST|PUT|PATCH|DELETE"}) by (resource)
sum(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}) by (resource)

Workaround

A fix would be to filter apiserver_request_duration_seconds_count to group="". With that change, availability is 100% (or very close) as expected.

sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE", group=""}[30d]))
-
sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))

Before submitting that, I'd like to understand how this was added and others don't see the issue.

  • Are folks dropping apiserver_request_duration_seconds_count non-core time series?
  • Do the above examples work for others? Maybe this is Kubernetes version specific

Versions

Kubernetes: v1.19.1
mixin: release-0.5

cc @metalmatze

@dghubble
Copy link
Contributor Author

This problem is my own. Its caused by dropping high cardinality metrics from apiserver_request_duration_seconds_bucket, but not from apiserver_request_duration_seconds_count. So either they both need to drop high cardinality metrics or neither should, any difference will break the SLO calculations as described above.

@dghubble
Copy link
Contributor Author

Prometheus Operator (which I'd guess maintainers test against) drops from some series from apiserver_request_duration_seconds_bucket and not from seconds_count, but not in a way that would affect the calculation. So I think that's why others haven't observed this.

https://github.com/prometheus-operator/kube-prometheus/blob/master/manifests/prometheus-serviceMonitorApiserver.yaml#L58

@dghubble
Copy link
Contributor Author

#498 (comment) was the solution, hopefully helpful to someone.

dghubble added a commit to poseidon/typhoon that referenced this issue Sep 13, 2020
* Reduce `apiserver_request_duration_seconds_count` cardinality
by dropping series for non-core Kubernetes APIs. This is done
to match `apiserver_request_duration_seconds_count` relabeling
* These two relabels must be performed the same way to avoid
affecting new SLO calculations (upcoming)
* See kubernetes-monitoring/kubernetes-mixin#498

Related: #596
dghubble added a commit to poseidon/typhoon that referenced this issue Sep 13, 2020
* Reduce `apiserver_request_duration_seconds_count` cardinality
by dropping series for non-core Kubernetes APIs. This is done
to match `apiserver_request_duration_seconds_count` relabeling
* These two relabels must be performed the same way to avoid
affecting new SLO calculations (upcoming)
* See kubernetes-monitoring/kubernetes-mixin#498

Related: #596
dghubble added a commit to poseidon/typhoon that referenced this issue Sep 13, 2020
* Reduce `apiserver_request_duration_seconds_count` cardinality
by dropping series for non-core Kubernetes APIs. This is done
to match `apiserver_request_duration_seconds_count` relabeling
* These two relabels must be performed the same way to avoid
affecting new SLO calculations (upcoming)
* See kubernetes-monitoring/kubernetes-mixin#498

Related: #596
@metalmatze
Copy link
Member

Yeah, you're totally right. The cardinality of these metrics is quite insane. Recently in SIG-instrumentation, there has even been a discussion around overhauling these metrics for exactly that reason, if I recall correctly.
Glad you figured it out yourself 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants