[Feature] Improve metrics #5689

eloo · 2022-12-14T13:29:58Z

Problem Statement

Hi,
i'm just right now to integrate our Kyverno metrics in our Grafana instance and i want to create some alerts for violated policies.

But it looks like the current metrics are not really useful for such a case.

E.g.
The metric kyverno_policy_results_total is a counter which means it's increasing the whole time.
Mostly counters are using for metrics where i want to create rate over.. for example for requests per seconds.
But in my opinion this does not makes real sense for kyverno policies.
At least i'm not really interested how many policies are failing per second during a background scan.
I would be more interested in the current amount of failed policies from the last run, so i get the current state of my policies.

With the current state of failing or passed policies i would be able to create proper alerts for monitoring.

Solution Description

Refactor (or create a new metric) for the policy result which represent the current state.

This can be done for example by setting the value to 0/1 for the current state of a policy.
This would also be like the result we see in the policy-reporter-ui

A good example how such metrics could look like is Flux

# HELP gotk_reconcile_condition The current condition status of a GitOps Toolkit resource reconciliation.
# TYPE gotk_reconcile_condition gauge
gotk_reconcile_condition{kind="Kustomization",name="podinfo",namespace="gotk-system",status="Deleted",type="Ready"} 0
gotk_reconcile_condition{kind="Kustomization",name="podinfo",namespace="gotk-system",status="False",type="Ready"} 0
gotk_reconcile_condition{kind="Kustomization",name="podinfo",namespace="gotk-system",status="True",type="Ready"} 1
gotk_reconcile_condition{kind="Kustomization",name="podinfo",namespace="gotk-system",status="Unknown",type="Ready"} 0

fluxcd/flux2#329

here we see the current state of the gotk reconcilers.
So in this example the podinfo reconciler is currently in the following state

ready: true

for example for kyverno policies a metric could look like this:

kyverno_policy_result{policy_background_mode="true", policy_name="temporary-role-binding", policy_namespace="-", policy_type="cluster", policy_validation_mode="audit", resource_kind="RBACDefinition", resource_request_operation="create", rule_execution_cause="background_scan", rule_name="temporary-role-binding", rule_type="validate", rule_result="fail"} 1
kyverno_policy_result{policy_background_mode="true", policy_name="temporary-role-binding", policy_namespace="-", policy_type="cluster", policy_validation_mode="audit", resource_kind="RBACDefinition", resource_request_operation="create", rule_execution_cause="background_scan", rule_name="temporary-role-binding", rule_type="validate", rule_result="pass"} 0
kyverno_policy_result{policy_background_mode="true", policy_name="temporary-role-binding", policy_namespace="-", policy_type="cluster", policy_validation_mode="audit", resource_kind="RBACDefinition", resource_request_operation="create", rule_execution_cause="background_scan", rule_name="temporary-role-binding", rule_type="validate", rule_result="skip"} 0

not sure if rule_result="skip" makes sense, maybe not

Alternatives

No response

Additional Context

No response

Slack discussion

No response

Research

I have read and followed the documentation AND the troubleshooting guide.
I have searched other issues in this repository and mine is not recorded.

The text was updated successfully, but these errors were encountered:

eddycharly · 2022-12-14T13:36:05Z

Let me check !

eddycharly · 2022-12-14T13:53:30Z

The main problem here is that a policy itself has no state, we need a resource to produce a result.

And the resource is not part of the labels because the cardinality could be really high.

eloo · 2022-12-14T14:35:46Z

but this resource should be already available.. isn't it the reports?

because the reporter-ui has this information.. so somewhere the information is already present..

but currently the metric is more or less useless

eddycharly · 2022-12-14T14:40:57Z

Let's imagine we make this metric a gauge, what should the value represent ?

eloo · 2022-12-14T14:48:18Z

the value should represent the last state..

but i have just found the information is already available in the policy-reporter

so maybe this is the way to go here?

e.g. in policy-reporter we have

policy_report_result{category="Data loss",kind="RdsCluster",name="trial-service-dev-rds-cluster",namespace="dev",policy="rds-enforce-deletion-protection",report="cpol-rds-enforce-deletion-protection",rule="rds-enforce-deletion-protection-cluster-claim",severity="high",source="kyverno",status="pass"} 1
policy_report_result{category="Data loss",kind="RdsCluster",name="trial-service-dev-rds-cluster",namespace="dev",policy="rds-enforce-final-snapshot",report="cpol-rds-enforce-final-snapshot",rule="rds-enforce-final-snapshot-cluster-claim",severity="high",source="kyverno",status="fail"} 1

so it looks like this is the more interesting endpoint for monitoring..

but then i'm curious what is the benefit of the kyverno metrics?

eddycharly · 2022-12-14T14:50:56Z

This is going to produce very high cardinality metrics, imagine a large cluster with lots of pods...
Prometheus is not going to like it.

eloo · 2022-12-14T15:06:07Z

right now the cardinality of the policy-reporter is around 2000 in my case with around 300 pods and around 60 rules in 14 cluster policies

for kyverno the cardinality is around 750 but the metrics from kyverno are not generate a real benefit here..

another approach to reduce the cardinality of the policy-reporter could be to only get the summaries of the policies..

but sadly the summary is missing important labels which are available in the report result

policy_report_summary{name="cpol-deployment-has-multiple-replicas",namespace="pre",status="Error"}
policy_report_result{category="High Availability",kind="Pod",name="trial-service-primary-6bc85bbf65-4c9dx",namespace="pre",policy="require-pod-probes",report="cpol-require-pod-probes",rule="validate-livenessProbe-readinessProbe",severity="high",source="kyverno",status="pass"} 1

so category, policy (policy is in the summary a bit mutated as report), severity are missing in the report_summary

so the question for kyverno (only the kyverno component) would be if the metrics are making sense or its better to use the metrics from policy-reporter?
and then maybe we should move this issue to policy-reporter to improve the metrics there

eddycharly · 2022-12-14T15:12:04Z

Of course it would make sense for kyverno to export relevant metrics as long as it is doable. It doesn't seem to be the case here.

Imagine you have 20000 pods, with 10 policies that have 2 rules each, combine this with the possible result:

20000 * 10 * 2 * 6 = 2400000

It gives a cardinality of 2400000.
I wonder how policy-reporter manages this.

eddycharly · 2022-12-14T15:13:47Z

cc @fjogeleit

eloo · 2022-12-14T15:25:04Z

okay.. yes this could be a problem :D

maybe policy-reporter is not managing this right now..

so it may makes sense to include only the policy_report_summary with a couple of good chosen labels..

in my current usecase i would like to know how many violations i have for each policy..
maybe grouped by namespace, severity, category, policy-name .. so i can decide if it needs urgent investigation or can be postponed a bit..
for the investigation i could use the policy_reporter_ui then to dig into the details of violations..

fjogeleit · 2022-12-14T15:40:02Z

hey @eloo, for the issue with cardinality it is possible to customize policy-reporter metrics.

https://kyverno.github.io/policy-reporter/guide/helm-chart-core#metric-customization

So its possible to define the labels you are interested in and reduce the cardinality to the needed minimum. This way you configure the policy_report_result metric and the summary metric is dropped I think - because it basically reproduce the summary of a (Cluster)PolicyReport without useful labels as you said.

eloo · 2022-12-14T15:48:45Z

@fjogeleit
ah nice.. that really looks promising!
will give it a try and give feedback

eddycharly · 2022-12-14T16:02:11Z

@fjogeleit then policy reporter accumulates the metrics based on the chosen labels ?

fjogeleit · 2022-12-14T16:04:15Z

@eddycharly yes.

eddycharly · 2022-12-14T16:17:04Z

@eloo please keep us updated of your findings, this can be useful to improve the current metrics.

fjogeleit · 2022-12-14T16:40:38Z

I am also thankful about Feedback, if this customization helps.

In your case you may also check out the metric filters to exclude for example pass and skip results to reduce the cardinality even further

eloo · 2022-12-14T17:31:11Z

okay.. so i have changed the metrics settings in the reporter to "simple" in our sandbox cluster and this looks pretty good right now

policy_report_result{category="High Availability",namespace="dev",policy="require-pdb",severity="high",source="kyverno",status="pass"} 1
policy_report_result{category="High Availability",namespace="dev",policy="require-pdb",severity="high",source="kyverno",status="skip"} 1

i have also compared this metrics to the UI which seems to match pretty well, so this metric can be used for created alerts on policy/namespace base which covers my usecase so far

so for this "small" cluster the cardinality of the policy-reporter is now around 130
kyverno cardinality is around 370

so now the metrics of the policy-reporter are scaling pretty well (keep in mind we have less policies and resources in sandbox cluster)

but i have seen that the kyverno-plugin for the policy-reporter is now also reporting metrics..
so far only a single interesting metric

# HELP kyverno_policy List of all Policies
# TYPE kyverno_policy gauge
kyverno_policy{background="true",category="",kind="ClusterPolicy",namespace="",policy="rds-enforce-backups",rule="rds-enforce-backups-cluster",severity="",type="validation",validationFailureAction="audit"} 1
...

as the metrics from kyverno_plugin are not really useful so far i will disable the metrics of the plugin..

so with the metrics from reporter-plugin in settings 'simple' it looks pretty good.

eddycharly · 2022-12-14T17:33:55Z

@eloo how are going to leverage the metric for alerting ?

eloo · 2022-12-14T17:41:30Z

@eddycharly not sure right now..
but i guess im simply checking specific severities or policy for the status "fail" and if the value is above 0 i will fire an alert

policy_report_result{category="High Availability",namespace="dev",policy="require-pdb",severity="high",source="kyverno",status="fail"} 1

e.g.

policy_report_result{namespace="live",severity="high",status="fail"} > 0

eloo · 2022-12-16T08:23:56Z

@fjogeleit
hi we have just realised that with disabling the kyverno-plugin metrics in the helm chart there is still a servicemonitor deployed for kyverno-plugin
https://github.com/kyverno/policy-reporter/blob/main/charts/policy-reporter/charts/monitoring/templates/kyverno-servicemonitor.yaml

i guess here it should be added if metrics: true and only if true the service monitor should be created

fjogeleit · 2022-12-16T08:56:26Z

Hm that’s unfortunately not possible this way because I can not access an value of the kyvernoPlugin within the monitoring plugin. You should be able to disable it by setting monitoring.plugin.kyverno to false

eloo · 2022-12-16T09:33:23Z

@fjogeleit
monitoring.plugin.kyverno is false by default as far as i see..

i guess the problem is the or here

{{- if or .Values.plugins.kyverno .Values.global.plugins.kyverno -}}

because the global.plugins.kyverno is enabled and can not then not disabled in monitoring

fjogeleit · 2022-12-16T09:41:37Z

hm the global.plugins.kyverno config is only a shortcut to enable it everywhere (UI + monitoring). You can set it to false as well and enable it only for the UI ui.plugins.kyverno.

I will try to improve this in the future.

eloo · 2022-12-16T12:23:25Z

@fjogeleit but i want to enable the plugin itself..
but do not want the service monitor because i have disabled the metrics anyway

so i guess just removing the global... should do the trick

fjogeleit · 2022-12-16T13:02:37Z

you don't need the global config to enable the plugin.

You can enable the plugin with kyvernoPlugin.enabled and in the UI with ui.plugins.kyverno and everything should work as it should and no ServiceMonitor should created. Assuming that global.plugins.kyverno is false (which is also the default)

eloo · 2022-12-16T13:14:33Z

@fjogeleit will check if soon

but so far i have an update to metrics

@eddycharly @fjogeleit
in the test cluster the policy-reporter now has a cardinality of around 230 (before it was 2000 with detailed)
and for kyverno itself around 1500 (looks like the 700 before was the wrong instance, because we have 3 of them)

the cardinality of the policy-reporter is way lower than kyverno and more useful

the highest cardinatily of the metrics are here:
kyverno_admission_review_duration_seconds_bucket: 588
kyverno_client_queries_total: 319
kyverno_policy_execution_duration_seconds_bucket: 288

maybe it would be good if the metrics could be also configured like its the case in policy-reporter

eddycharly · 2022-12-16T13:30:52Z

the highest cardinatily of the metrics are here:
kyverno_admission_review_duration_seconds_bucket: 588
kyverno_client_queries_total: 319
kyverno_policy_execution_duration_seconds_bucket: 288

Those numbers sound acceptable to me.
The labels used for those metrics are controlled and should not generate high cardinality, what would be the point of configuring them further ?

fjogeleit · 2022-12-16T14:24:25Z

I think the kyverno metrics had another intention as the metrics created by policy reporter. They are more focusing on performance and stability of Kyverno and its Controllers.

If I also remember correctly Kyverno metrics have no persisted state, so if Kyverno restarts - the previous metrics are not available anymore regarding to policy results.

And if Kyverno would produce the same metrics as Policy Reporter - Policy Reporter would become useless ^^".

eddycharly · 2022-12-16T14:32:26Z

I think the kyverno metrics had another intention as the metrics created by policy reporter. They are more focusing on performance and stability of Kyverno and its Controllers.

True

If I also remember correctly Kyverno metrics have no persisted state, so if Kyverno restarts - the previous metrics are not available anymore regarding to policy results.

Most metrics are counters and histograms, they only apply to the lifetime of the pod.
We have a few gauges too, the implementation will change in 1.9, in previous versions it was not possible to unregister a metric and they kept being exposed even when a policy was deleted from the cluster (iirc the trick was to set the value to 0).

This will no longer be the case in 1.9.

And if Kyverno would produce the same metrics as Policy Reporter - Policy Reporter would become useless ^^".

They are definitely complementary.
Policy reporter transforms policy reports into metrics with a configurable system.
Kyverno could do the same thing but I don't see a point in doing it as policy reporter does it pretty well.

eddycharly · 2023-02-10T08:21:29Z

Is there anything left to discuss here ?

eloo · 2023-02-10T08:30:18Z

@eddycharly from my side its now clear.
maybe a hint in the docs would be useful that metrics about the policies should be scraped from the reporter and kyverno itself is "only" reporting runtime metrics about itself.

thanks

james-callahan · 2023-07-06T00:56:54Z

@eddycharly we're seeing kyverno_policy_execution_duration_seconds_bucket take up a significant amount of cardinality in our monitoring stack:

nabadger · 2023-08-02T08:14:36Z

We see the same issue regarding kyverno_policy_execution_duration_seconds_bucket (sum, count) values are ok. We will probably have to drop this metric for now in our stack (it's not actually used in the kyverno dashboards and we have no alerts based on it).

vat-dsipe · 2023-10-11T13:13:09Z

We see the same issue regarding kyverno_policy_execution_duration_seconds_bucket (sum, count) values are ok. We will probably have to drop this metric for now in our stack (it's not actually used in the kyverno dashboards and we have no alerts based on it).

Hi, may you advise how I would be able to drop a metrics, in particular this one, please?
thanks in advance for your help

running kyverno 1.9.4 . Installed with helm<

fjogeleit · 2023-10-11T13:48:40Z

You can configure the metricRelabelings of the service monitors to drop this metric:

metricRelabelings:
- sourceLabels:
  - __name__
  regex: kyverno_policy_execution_duration_seconds_bucket
  action: drop

eloo added enhancement New feature or request triage Default label assigned to all new issues indicating label curation is needed to fully organize. labels Dec 14, 2022

chipzoller added prometheus observability and removed triage Default label assigned to all new issues indicating label curation is needed to fully organize. prometheus labels Dec 14, 2022

eddycharly self-assigned this Dec 14, 2022

eddycharly added the stale Stale issue, may be closed in the near future if nothing happens label Feb 10, 2023

fjogeleit mentioned this issue Nov 28, 2023

[Feature] New metric kyverno_policy_result_state to report the continuing state of PolicyReports #8917

Closed

2 tasks

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 6, 2024

dosubot bot removed the stale Stale issue, may be closed in the near future if nothing happens label Sep 6, 2024

[Feature] Improve metrics #5689

[Feature] Improve metrics #5689

Comments

eloo commented Dec 14, 2022

Problem Statement

Solution Description

Alternatives

Additional Context

Slack discussion

Research

eddycharly commented Dec 14, 2022

eddycharly commented Dec 14, 2022

eloo commented Dec 14, 2022

eddycharly commented Dec 14, 2022

eloo commented Dec 14, 2022

eddycharly commented Dec 14, 2022

eloo commented Dec 14, 2022

eddycharly commented Dec 14, 2022

eddycharly commented Dec 14, 2022

eloo commented Dec 14, 2022

fjogeleit commented Dec 14, 2022 • edited Loading

eloo commented Dec 14, 2022

eddycharly commented Dec 14, 2022

fjogeleit commented Dec 14, 2022

eddycharly commented Dec 14, 2022

fjogeleit commented Dec 14, 2022 • edited Loading

eloo commented Dec 14, 2022 • edited Loading

eddycharly commented Dec 14, 2022

eloo commented Dec 14, 2022

eloo commented Dec 16, 2022

fjogeleit commented Dec 16, 2022

eloo commented Dec 16, 2022

fjogeleit commented Dec 16, 2022 • edited Loading

eloo commented Dec 16, 2022

fjogeleit commented Dec 16, 2022

eloo commented Dec 16, 2022

eddycharly commented Dec 16, 2022

fjogeleit commented Dec 16, 2022

eddycharly commented Dec 16, 2022

eddycharly commented Feb 10, 2023

eloo commented Feb 10, 2023

james-callahan commented Jul 6, 2023 • edited Loading

nabadger commented Aug 2, 2023 • edited Loading

vat-dsipe commented Oct 11, 2023 • edited Loading

fjogeleit commented Oct 11, 2023

fjogeleit commented Dec 14, 2022 •

edited

Loading

fjogeleit commented Dec 14, 2022 •

edited

Loading

eloo commented Dec 14, 2022 •

edited

Loading

fjogeleit commented Dec 16, 2022 •

edited

Loading

james-callahan commented Jul 6, 2023 •

edited

Loading

nabadger commented Aug 2, 2023 •

edited

Loading

vat-dsipe commented Oct 11, 2023 •

edited

Loading