Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Improve metrics #5689

Closed
2 tasks done
eloo opened this issue Dec 14, 2022 · 35 comments
Closed
2 tasks done

[Feature] Improve metrics #5689

eloo opened this issue Dec 14, 2022 · 35 comments
Assignees
Labels
enhancement New feature or request observability

Comments

@eloo
Copy link

eloo commented Dec 14, 2022

Problem Statement

Hi,
i'm just right now to integrate our Kyverno metrics in our Grafana instance and i want to create some alerts for violated policies.

But it looks like the current metrics are not really useful for such a case.

E.g.
The metric kyverno_policy_results_total is a counter which means it's increasing the whole time.
Mostly counters are using for metrics where i want to create rate over.. for example for requests per seconds.
But in my opinion this does not makes real sense for kyverno policies.
At least i'm not really interested how many policies are failing per second during a background scan.
I would be more interested in the current amount of failed policies from the last run, so i get the current state of my policies.

With the current state of failing or passed policies i would be able to create proper alerts for monitoring.

Solution Description

Refactor (or create a new metric) for the policy result which represent the current state.

This can be done for example by setting the value to 0/1 for the current state of a policy.
This would also be like the result we see in the policy-reporter-ui

A good example how such metrics could look like is Flux

# HELP gotk_reconcile_condition The current condition status of a GitOps Toolkit resource reconciliation.
# TYPE gotk_reconcile_condition gauge
gotk_reconcile_condition{kind="Kustomization",name="podinfo",namespace="gotk-system",status="Deleted",type="Ready"} 0
gotk_reconcile_condition{kind="Kustomization",name="podinfo",namespace="gotk-system",status="False",type="Ready"} 0
gotk_reconcile_condition{kind="Kustomization",name="podinfo",namespace="gotk-system",status="True",type="Ready"} 1
gotk_reconcile_condition{kind="Kustomization",name="podinfo",namespace="gotk-system",status="Unknown",type="Ready"} 0

fluxcd/flux2#329

here we see the current state of the gotk reconcilers.
So in this example the podinfo reconciler is currently in the following state

ready: true

for example for kyverno policies a metric could look like this:

kyverno_policy_result{policy_background_mode="true", policy_name="temporary-role-binding", policy_namespace="-", policy_type="cluster", policy_validation_mode="audit", resource_kind="RBACDefinition", resource_request_operation="create", rule_execution_cause="background_scan", rule_name="temporary-role-binding", rule_type="validate", rule_result="fail"} 1
kyverno_policy_result{policy_background_mode="true", policy_name="temporary-role-binding", policy_namespace="-", policy_type="cluster", policy_validation_mode="audit", resource_kind="RBACDefinition", resource_request_operation="create", rule_execution_cause="background_scan", rule_name="temporary-role-binding", rule_type="validate", rule_result="pass"} 0
kyverno_policy_result{policy_background_mode="true", policy_name="temporary-role-binding", policy_namespace="-", policy_type="cluster", policy_validation_mode="audit", resource_kind="RBACDefinition", resource_request_operation="create", rule_execution_cause="background_scan", rule_name="temporary-role-binding", rule_type="validate", rule_result="skip"} 0

not sure if rule_result="skip" makes sense, maybe not

Alternatives

No response

Additional Context

No response

Slack discussion

No response

Research

  • I have read and followed the documentation AND the troubleshooting guide.
  • I have searched other issues in this repository and mine is not recorded.
@eloo eloo added enhancement New feature or request triage Default label assigned to all new issues indicating label curation is needed to fully organize. labels Dec 14, 2022
@chipzoller chipzoller added prometheus observability and removed triage Default label assigned to all new issues indicating label curation is needed to fully organize. prometheus labels Dec 14, 2022
@eddycharly
Copy link
Member

Let me check !

@eddycharly eddycharly self-assigned this Dec 14, 2022
@eddycharly
Copy link
Member

The main problem here is that a policy itself has no state, we need a resource to produce a result.

And the resource is not part of the labels because the cardinality could be really high.

@eloo
Copy link
Author

eloo commented Dec 14, 2022

but this resource should be already available.. isn't it the reports?

because the reporter-ui has this information.. so somewhere the information is already present..

but currently the metric is more or less useless

@eddycharly
Copy link
Member

Let's imagine we make this metric a gauge, what should the value represent ?

@eloo
Copy link
Author

eloo commented Dec 14, 2022

the value should represent the last state..

but i have just found the information is already available in the policy-reporter

so maybe this is the way to go here?

e.g. in policy-reporter we have

policy_report_result{category="Data loss",kind="RdsCluster",name="trial-service-dev-rds-cluster",namespace="dev",policy="rds-enforce-deletion-protection",report="cpol-rds-enforce-deletion-protection",rule="rds-enforce-deletion-protection-cluster-claim",severity="high",source="kyverno",status="pass"} 1
policy_report_result{category="Data loss",kind="RdsCluster",name="trial-service-dev-rds-cluster",namespace="dev",policy="rds-enforce-final-snapshot",report="cpol-rds-enforce-final-snapshot",rule="rds-enforce-final-snapshot-cluster-claim",severity="high",source="kyverno",status="fail"} 1

so it looks like this is the more interesting endpoint for monitoring..

but then i'm curious what is the benefit of the kyverno metrics?

@eddycharly
Copy link
Member

This is going to produce very high cardinality metrics, imagine a large cluster with lots of pods...
Prometheus is not going to like it.

@eloo
Copy link
Author

eloo commented Dec 14, 2022

right now the cardinality of the policy-reporter is around 2000 in my case with around 300 pods and around 60 rules in 14 cluster policies

for kyverno the cardinality is around 750 but the metrics from kyverno are not generate a real benefit here..

another approach to reduce the cardinality of the policy-reporter could be to only get the summaries of the policies..

but sadly the summary is missing important labels which are available in the report result

policy_report_summary{name="cpol-deployment-has-multiple-replicas",namespace="pre",status="Error"}
policy_report_result{category="High Availability",kind="Pod",name="trial-service-primary-6bc85bbf65-4c9dx",namespace="pre",policy="require-pod-probes",report="cpol-require-pod-probes",rule="validate-livenessProbe-readinessProbe",severity="high",source="kyverno",status="pass"} 1

so category, policy (policy is in the summary a bit mutated as report), severity are missing in the report_summary

so the question for kyverno (only the kyverno component) would be if the metrics are making sense or its better to use the metrics from policy-reporter?
and then maybe we should move this issue to policy-reporter to improve the metrics there

@eddycharly
Copy link
Member

Of course it would make sense for kyverno to export relevant metrics as long as it is doable. It doesn't seem to be the case here.

Imagine you have 20000 pods, with 10 policies that have 2 rules each, combine this with the possible result:

20000 * 10 * 2 * 6 = 2400000

It gives a cardinality of 2400000.
I wonder how policy-reporter manages this.

@eddycharly
Copy link
Member

cc @fjogeleit

@eloo
Copy link
Author

eloo commented Dec 14, 2022

okay.. yes this could be a problem :D

maybe policy-reporter is not managing this right now..

so it may makes sense to include only the policy_report_summary with a couple of good chosen labels..

in my current usecase i would like to know how many violations i have for each policy..
maybe grouped by namespace, severity, category, policy-name .. so i can decide if it needs urgent investigation or can be postponed a bit..
for the investigation i could use the policy_reporter_ui then to dig into the details of violations..

@fjogeleit
Copy link
Member

fjogeleit commented Dec 14, 2022

hey @eloo, for the issue with cardinality it is possible to customize policy-reporter metrics.

https://kyverno.github.io/policy-reporter/guide/helm-chart-core#metric-customization

So its possible to define the labels you are interested in and reduce the cardinality to the needed minimum. This way you configure the policy_report_result metric and the summary metric is dropped I think - because it basically reproduce the summary of a (Cluster)PolicyReport without useful labels as you said.

@eloo
Copy link
Author

eloo commented Dec 14, 2022

@fjogeleit
ah nice.. that really looks promising!
will give it a try and give feedback

@eddycharly
Copy link
Member

@fjogeleit then policy reporter accumulates the metrics based on the chosen labels ?

@fjogeleit
Copy link
Member

@eddycharly yes.

@eddycharly
Copy link
Member

@eloo please keep us updated of your findings, this can be useful to improve the current metrics.

@fjogeleit
Copy link
Member

fjogeleit commented Dec 14, 2022

I am also thankful about Feedback, if this customization helps.

In your case you may also check out the metric filters to exclude for example pass and skip results to reduce the cardinality even further

@eloo
Copy link
Author

eloo commented Dec 14, 2022

okay.. so i have changed the metrics settings in the reporter to "simple" in our sandbox cluster and this looks pretty good right now

policy_report_result{category="High Availability",namespace="dev",policy="require-pdb",severity="high",source="kyverno",status="pass"} 1
policy_report_result{category="High Availability",namespace="dev",policy="require-pdb",severity="high",source="kyverno",status="skip"} 1

i have also compared this metrics to the UI which seems to match pretty well, so this metric can be used for created alerts on policy/namespace base which covers my usecase so far

so for this "small" cluster the cardinality of the policy-reporter is now around 130
kyverno cardinality is around 370

so now the metrics of the policy-reporter are scaling pretty well (keep in mind we have less policies and resources in sandbox cluster)

but i have seen that the kyverno-plugin for the policy-reporter is now also reporting metrics..
so far only a single interesting metric

# HELP kyverno_policy List of all Policies
# TYPE kyverno_policy gauge
kyverno_policy{background="true",category="",kind="ClusterPolicy",namespace="",policy="rds-enforce-backups",rule="rds-enforce-backups-cluster",severity="",type="validation",validationFailureAction="audit"} 1
...

as the metrics from kyverno_plugin are not really useful so far i will disable the metrics of the plugin..

so with the metrics from reporter-plugin in settings 'simple' it looks pretty good.

@eddycharly
Copy link
Member

@eloo how are going to leverage the metric for alerting ?

@eloo
Copy link
Author

eloo commented Dec 14, 2022

@eddycharly not sure right now..
but i guess im simply checking specific severities or policy for the status "fail" and if the value is above 0 i will fire an alert

policy_report_result{category="High Availability",namespace="dev",policy="require-pdb",severity="high",source="kyverno",status="fail"} 1

e.g.

policy_report_result{namespace="live",severity="high",status="fail"} > 0

@eloo
Copy link
Author

eloo commented Dec 16, 2022

@fjogeleit
hi we have just realised that with disabling the kyverno-plugin metrics in the helm chart there is still a servicemonitor deployed for kyverno-plugin
https://github.com/kyverno/policy-reporter/blob/main/charts/policy-reporter/charts/monitoring/templates/kyverno-servicemonitor.yaml

i guess here it should be added if metrics: true and only if true the service monitor should be created

@fjogeleit
Copy link
Member

Hm that’s unfortunately not possible this way because I can not access an value of the kyvernoPlugin within the monitoring plugin. You should be able to disable it by setting monitoring.plugin.kyverno to false

@eloo
Copy link
Author

eloo commented Dec 16, 2022

@fjogeleit
monitoring.plugin.kyverno is false by default as far as i see..

i guess the problem is the or here

{{- if or .Values.plugins.kyverno .Values.global.plugins.kyverno -}}

because the global.plugins.kyverno is enabled and can not then not disabled in monitoring

@fjogeleit
Copy link
Member

fjogeleit commented Dec 16, 2022

hm the global.plugins.kyverno config is only a shortcut to enable it everywhere (UI + monitoring). You can set it to false as well and enable it only for the UI ui.plugins.kyverno.

I will try to improve this in the future.

@eloo
Copy link
Author

eloo commented Dec 16, 2022

@fjogeleit but i want to enable the plugin itself..
but do not want the service monitor because i have disabled the metrics anyway

so i guess just removing the global... should do the trick

@fjogeleit
Copy link
Member

you don't need the global config to enable the plugin.

You can enable the plugin with kyvernoPlugin.enabled and in the UI with ui.plugins.kyverno and everything should work as it should and no ServiceMonitor should created. Assuming that global.plugins.kyverno is false (which is also the default)

@eloo
Copy link
Author

eloo commented Dec 16, 2022

@fjogeleit will check if soon

but so far i have an update to metrics

@eddycharly @fjogeleit
in the test cluster the policy-reporter now has a cardinality of around 230 (before it was 2000 with detailed)
and for kyverno itself around 1500 (looks like the 700 before was the wrong instance, because we have 3 of them)

the cardinality of the policy-reporter is way lower than kyverno and more useful

the highest cardinatily of the metrics are here:
kyverno_admission_review_duration_seconds_bucket: 588
kyverno_client_queries_total: 319
kyverno_policy_execution_duration_seconds_bucket: 288

maybe it would be good if the metrics could be also configured like its the case in policy-reporter

@eddycharly
Copy link
Member

the highest cardinatily of the metrics are here:
kyverno_admission_review_duration_seconds_bucket: 588
kyverno_client_queries_total: 319
kyverno_policy_execution_duration_seconds_bucket: 288

Those numbers sound acceptable to me.
The labels used for those metrics are controlled and should not generate high cardinality, what would be the point of configuring them further ?

@fjogeleit
Copy link
Member

I think the kyverno metrics had another intention as the metrics created by policy reporter. They are more focusing on performance and stability of Kyverno and its Controllers.

If I also remember correctly Kyverno metrics have no persisted state, so if Kyverno restarts - the previous metrics are not available anymore regarding to policy results.

And if Kyverno would produce the same metrics as Policy Reporter - Policy Reporter would become useless ^^".

@eddycharly
Copy link
Member

I think the kyverno metrics had another intention as the metrics created by policy reporter. They are more focusing on performance and stability of Kyverno and its Controllers.

True

If I also remember correctly Kyverno metrics have no persisted state, so if Kyverno restarts - the previous metrics are not available anymore regarding to policy results.

Most metrics are counters and histograms, they only apply to the lifetime of the pod.
We have a few gauges too, the implementation will change in 1.9, in previous versions it was not possible to unregister a metric and they kept being exposed even when a policy was deleted from the cluster (iirc the trick was to set the value to 0).

This will no longer be the case in 1.9.

And if Kyverno would produce the same metrics as Policy Reporter - Policy Reporter would become useless ^^".

They are definitely complementary.
Policy reporter transforms policy reports into metrics with a configurable system.
Kyverno could do the same thing but I don't see a point in doing it as policy reporter does it pretty well.

@eddycharly
Copy link
Member

Is there anything left to discuss here ?

@eddycharly eddycharly added the stale Stale issue, may be closed in the near future if nothing happens label Feb 10, 2023
@eloo
Copy link
Author

eloo commented Feb 10, 2023

@eddycharly from my side its now clear.
maybe a hint in the docs would be useful that metrics about the policies should be scraped from the reporter and kyverno itself is "only" reporting runtime metrics about itself.

thanks

@james-callahan
Copy link
Contributor

james-callahan commented Jul 6, 2023

@eddycharly we're seeing kyverno_policy_execution_duration_seconds_bucket take up a significant amount of cardinality in our monitoring stack:
image
image

@nabadger
Copy link

nabadger commented Aug 2, 2023

We see the same issue regarding kyverno_policy_execution_duration_seconds_bucket (sum, count) values are ok. We will probably have to drop this metric for now in our stack (it's not actually used in the kyverno dashboards and we have no alerts based on it).

@vat-dsipe
Copy link

vat-dsipe commented Oct 11, 2023

We see the same issue regarding kyverno_policy_execution_duration_seconds_bucket (sum, count) values are ok. We will probably have to drop this metric for now in our stack (it's not actually used in the kyverno dashboards and we have no alerts based on it).

Hi, may you advise how I would be able to drop a metrics, in particular this one, please?
thanks in advance for your help

running kyverno 1.9.4 . Installed with helm<

@fjogeleit
Copy link
Member

You can configure the metricRelabelings of the service monitors to drop this metric:

metricRelabelings:
- sourceLabels:
  - __name__
  regex: kyverno_policy_execution_duration_seconds_bucket
  action: drop

@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 6, 2024
@dosubot dosubot bot removed the stale Stale issue, may be closed in the near future if nothing happens label Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request observability
Projects
Development

No branches or pull requests

7 participants