Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mimir mixin Disk space utilization panels broken for mimir helm chart #7515

Closed
jmichalek132 opened this issue Mar 1, 2024 · 14 comments · Fixed by #8212
Closed

Mimir mixin Disk space utilization panels broken for mimir helm chart #7515

jmichalek132 opened this issue Mar 1, 2024 · 14 comments · Fixed by #8212

Comments

@jmichalek132
Copy link
Contributor

Describe the bug

The disk space utilization panels in the mimir mixin don't work with mimir deployed using the mimir-distributed helm chart.

Query example from:

image

max by(persistentvolumeclaim) (
  kubelet_volume_stats_used_bytes{cluster=~"$cluster", namespace=~"$namespace"} /
  kubelet_volume_stats_capacity_bytes{cluster=~"$cluster", namespace=~"$namespace"}
)
and
count by(persistentvolumeclaim) (
  kube_persistentvolumeclaim_labels{
    cluster=~"$cluster", namespace=~"$namespace",
    label_name=~"(ingester).*"
  }
)

Problematic part is label_name=~"(ingester).*" on the kube_persistentvolumeclaim_labels metric coming from.

There are 2 issues with it;

  • If you the kube-state-metrics service monitor from metamonitoring it will drop the kube_persistentvolumeclaim_labels metric due to this metric relabeling rule
  • The PVCs produced don't have this label:
  labels:
    app.kubernetes.io/component: ingester
    app.kubernetes.io/instance: metrics
    app.kubernetes.io/name: mimir
    rollout-group: ingester
    zone: zone-a

To Reproduce

Steps to reproduce the behavior:

  1. Start mimir distributed helm chart
  2. Deploy the mimir mixin

Expected behavior

For Disk space utilization panels to work.

Environment

  • Infrastructure: Kubernetes, AKS
  • Deployment tool: Helm

Additional Context

Not sure what would be the best way to fix this.
Things that come to mind

  • Update the service monitor to not drop more used metrics by the mixin
  • Allow configuring which labels is used instead of label_name in those panels

Willing to submit a PR to address this after feedback.

@dimitarvdimitrov
Copy link
Contributor

dimitarvdimitrov commented Mar 1, 2024

Keeping the kube_persistentvolumeclaim_labels metric just for the pods from mimir looks tricky. Can we use the 'app.kubernetess.io/name: mimir' inside the relabelling config to pick this up? To avoid scraping other mimir clusters in the same k8s cluster we can add another 'keep' relabelling which checks the namespace label (and need to make sure it's present on all currently collected metrics)

Regarding the second problem of using different labels - is it possible to use label_replace in the panel query so we take either 'label_rollout_group' or 'label_name' - whichever is present? I.e. replace label_name woth the other only if the other is non-empty, maybe theres something with regex we can do?

@QuentinBisson
Copy link
Contributor

QuentinBisson commented Apr 17, 2024

@dimitarvdimitrov I'm just facing this issue and would using the app.kubernetes.io/component: ingester label not be a better choice than rollout group for the query? I can work on that as I really need this to be fixed

@dimitarvdimitrov
Copy link
Contributor

using app.kubernetess.io/name: mimir makes this slightly more reusable - we don't have to update the scraping rules every time a new component has disk (or we rename the component or add disk to an existing component, etc). If that's not possible, then app.kubernetes.io/component should also suffice

@QuentinBisson
Copy link
Contributor

I think the issue with using app.kubernetess.io/name is that it would not work on specific write dashboard because it would write thé data of all Mimir components instead of just the ingester

@QuentinBisson
Copy link
Contributor

See PR here #7968

@QuentinBisson
Copy link
Contributor

Let me know how I can speed things up :)

@dimitarvdimitrov
Copy link
Contributor

Using a common label on the kube_persistentvolumeclaim_labels selector

My suggestion in the issue was to add support for both labels in panel via label_replace or label_join promQL function. But I realized this won't work because we cannot filter on the series labels outside of the vector selector (kube_persistentvolumeclaim_labels{...})

Another option for solving the selector problem is to add the standard kubernetes labels (app.kubernetes.io/*) to the jsonnet mixin and use that in the dashboards like the ones here

app.kubernetes.io/name: {{ include "mimir.name" .ctx }}
app.kubernetes.io/instance: {{ .ctx.Release.Name }}
{{- if .component }}
app.kubernetes.io/component: {{ .component }}
{{- end }}
{{- if .memberlist }}
app.kubernetes.io/part-of: memberlist
{{- end }}
{{- if .ctx.Chart.AppVersion }}
app.kubernetes.io/version: {{ .ctx.Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .ctx.Release.Service }}

@QuentinBisson
Copy link
Contributor

@dimitarvdimitrov I'm not sure what you mean by adding the label to the jsonnet mixin, I'm fine implementing a working solution though :D

The main issue with using the kube_persistentvolumeclaim_labels is that the newer versions of kube-state-metrics do not expose any labels by defaults (since 2.11 I think) and they need to be explicitely asked for

@dimitarvdimitrov
Copy link
Contributor

adding the label to the jsonnet mixin

I meant adding the label to the resources created by the jsonnet library to deploy Mimir. This will help with having a single label selector in the promQL query because it will match both jsonnet and helm deployments because they will have some label in common

The main issue with using the kube_persistentvolumeclaim_labels is that the newer versions of kube-state-metrics do not expose any labels by defaults (since 2.11 I think) and they need to be explicitely asked for

That doesn't sound great. So the metric is just empty by default? I couldn't find this change in the changelog. Do you have a link to the PR or changelog entry?

@QuentinBisson
Copy link
Contributor

I was wrong, this happened in kube-state-metrics 2.10 https://github.com/kubernetes/kube-state-metrics/releases/tag/v2.10.0 (cf. the top message)

@QuentinBisson
Copy link
Contributor

Regarding the fix I would assume this would be needed under operations/mimir? In that case I'm not sure I would be the best to fix it because I'm definitely lost in this folder

@QuentinBisson
Copy link
Contributor

QuentinBisson commented May 14, 2024

@dimitarvdimitrov what do you think about using something like

kubelet_volume_stats_used_bytes{cluster_id=~"$cluster", namespace=~"$namespace", persistentvolumeclaim=~".*(ingester)-.*"}
/
kubelet_volume_stats_capacity_bytes{cluster_id=~"$cluster", namespace=~"$namespace", persistentvolumeclaim=~".*(ingester)-.*"}

instead of relying on the kube-state-metrics labels metric?

@dimitarvdimitrov
Copy link
Contributor

dimitarvdimitrov commented May 24, 2024

That's a neat idea and is much simpler too. I don't see a reason why it won't work. Happy to review a PR to carry out that change

@QuentinBisson
Copy link
Contributor

PR to fix it #8212

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants