-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Improve metrics #5689
Comments
Let me check ! |
The main problem here is that a policy itself has no state, we need a resource to produce a result. And the resource is not part of the labels because the cardinality could be really high. |
but this resource should be already available.. isn't it the reports? because the reporter-ui has this information.. so somewhere the information is already present.. but currently the metric is more or less useless |
Let's imagine we make this metric a gauge, what should the value represent ? |
the value should represent the last state.. but i have just found the information is already available in the policy-reporter so maybe this is the way to go here? e.g. in policy-reporter we have
so it looks like this is the more interesting endpoint for monitoring.. but then i'm curious what is the benefit of the kyverno metrics? |
This is going to produce very high cardinality metrics, imagine a large cluster with lots of pods... |
right now the cardinality of the policy-reporter is around 2000 in my case with around 300 pods and around 60 rules in 14 cluster policies for kyverno the cardinality is around 750 but the metrics from kyverno are not generate a real benefit here.. another approach to reduce the cardinality of the policy-reporter could be to only get the summaries of the policies.. but sadly the summary is missing important labels which are available in the report result
so category, policy (policy is in the summary a bit mutated as report), severity are missing in the report_summary so the question for kyverno (only the kyverno component) would be if the metrics are making sense or its better to use the metrics from policy-reporter? |
Of course it would make sense for kyverno to export relevant metrics as long as it is doable. It doesn't seem to be the case here. Imagine you have 20000 pods, with 10 policies that have 2 rules each, combine this with the possible result:
It gives a cardinality of 2400000. |
cc @fjogeleit |
okay.. yes this could be a problem :D maybe policy-reporter is not managing this right now.. so it may makes sense to include only the policy_report_summary with a couple of good chosen labels.. in my current usecase i would like to know how many violations i have for each policy.. |
hey @eloo, for the issue with cardinality it is possible to customize policy-reporter metrics. https://kyverno.github.io/policy-reporter/guide/helm-chart-core#metric-customization So its possible to define the labels you are interested in and reduce the cardinality to the needed minimum. This way you configure the |
@fjogeleit |
@fjogeleit then policy reporter accumulates the metrics based on the chosen labels ? |
@eddycharly yes. |
@eloo please keep us updated of your findings, this can be useful to improve the current metrics. |
I am also thankful about Feedback, if this customization helps. In your case you may also check out the metric filters to exclude for example pass and skip results to reduce the cardinality even further |
okay.. so i have changed the metrics settings in the reporter to "simple" in our sandbox cluster and this looks pretty good right now
i have also compared this metrics to the UI which seems to match pretty well, so this metric can be used for created alerts on policy/namespace base which covers my usecase so far so for this "small" cluster the cardinality of the policy-reporter is now around 130 so now the metrics of the policy-reporter are scaling pretty well (keep in mind we have less policies and resources in sandbox cluster) but i have seen that the kyverno-plugin for the policy-reporter is now also reporting metrics..
as the metrics from kyverno_plugin are not really useful so far i will disable the metrics of the plugin.. so with the metrics from reporter-plugin in settings 'simple' it looks pretty good. |
@eloo how are going to leverage the metric for alerting ? |
@eddycharly not sure right now..
e.g.
|
@fjogeleit i guess here it should be added if |
Hm that’s unfortunately not possible this way because I can not access an value of the kyvernoPlugin within the monitoring plugin. You should be able to disable it by setting |
@fjogeleit i guess the problem is the
because the global.plugins.kyverno is enabled and can not then not disabled in monitoring |
hm the I will try to improve this in the future. |
@fjogeleit but i want to enable the plugin itself.. so i guess just removing the global... should do the trick |
you don't need the global config to enable the plugin. You can enable the plugin with |
@fjogeleit will check if soon but so far i have an update to metrics @eddycharly @fjogeleit the cardinality of the policy-reporter is way lower than kyverno and more useful the highest cardinatily of the metrics are here: maybe it would be good if the metrics could be also configured like its the case in policy-reporter |
Those numbers sound acceptable to me. |
I think the kyverno metrics had another intention as the metrics created by policy reporter. They are more focusing on performance and stability of Kyverno and its Controllers. If I also remember correctly Kyverno metrics have no persisted state, so if Kyverno restarts - the previous metrics are not available anymore regarding to policy results. And if Kyverno would produce the same metrics as Policy Reporter - Policy Reporter would become useless ^^". |
True
Most metrics are counters and histograms, they only apply to the lifetime of the pod. This will no longer be the case in 1.9.
They are definitely complementary. |
Is there anything left to discuss here ? |
@eddycharly from my side its now clear. thanks |
@eddycharly we're seeing |
We see the same issue regarding |
Hi, may you advise how I would be able to drop a metrics, in particular this one, please?
|
You can configure the metricRelabelings of the service monitors to drop this metric:
|
Problem Statement
Hi,
i'm just right now to integrate our Kyverno metrics in our Grafana instance and i want to create some alerts for violated policies.
But it looks like the current metrics are not really useful for such a case.
E.g.
The metric
kyverno_policy_results_total
is a counter which means it's increasing the whole time.Mostly counters are using for metrics where i want to create rate over.. for example for
requests per seconds
.But in my opinion this does not makes real sense for kyverno policies.
At least i'm not really interested how many policies are failing per second during a background scan.
I would be more interested in the current amount of failed policies from the last run, so i get the current state of my policies.
With the current state of failing or passed policies i would be able to create proper alerts for monitoring.
Solution Description
Refactor (or create a new metric) for the policy result which represent the current state.
This can be done for example by setting the value to 0/1 for the current state of a policy.
This would also be like the result we see in the policy-reporter-ui
A good example how such metrics could look like is Flux
fluxcd/flux2#329
here we see the current state of the gotk reconcilers.
So in this example the
podinfo
reconciler is currently in the following statefor example for kyverno policies a metric could look like this:
not sure if
rule_result="skip"
makes sense, maybe notAlternatives
No response
Additional Context
No response
Slack discussion
No response
Research
The text was updated successfully, but these errors were encountered: