Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Considerably high metric cardinality #1479

Closed
paulfantom opened this issue Jun 4, 2021 · 7 comments
Closed

Considerably high metric cardinality #1479

paulfantom opened this issue Jun 4, 2021 · 7 comments

Comments

@paulfantom
Copy link
Contributor

Describe the bug

Basic installation of fluxv2 produces ~6000 metric series. Majority (~5000) of those come from rest_client_request_latency_seconds_.* buckets. As far as I can see, only a small subset of data from those metrics is actually used (I found them used only in one panel in "Flux Control Plane" dashboard).

Are those used for anything else? If so, maybe there would be a way to reduce their cardinality?

To Reproduce

Steps to reproduce the behaviour:

  1. Applied https://github.com/fluxcd/flux2/releases/download/v0.14.2/install.yaml to a cluster
  2. All manifests used for installation are available at https://github.com/thaum-xyz/ankhmorpork/tree/d24dd02a77479c7884965a23b394a9ee86f279a4

Expected behavior

Less metrics, but of high quality.

Additional context

  • Kubernetes version: k3s 1.19.7
  • Git provider: ---
  • Container registry provider: ---

Below please provide the output of the following commands:

flux --version
flux check
kubectl -n <namespace> get all
kubectl -n <namespace> logs deploy/source-controller
kubectl -n <namespace> logs deploy/kustomize-controller
@paulfantom
Copy link
Contributor Author

Data from prometheus query count({job="flux-system/flux-system"}):

Just after installation: 6522
After discarding rest_client_request_latency_seconds_.*: 950

@stefanprodan
Copy link
Member

stefanprodan commented Jun 4, 2021

These metrics come from controller-runtime, it's the Kubernetes SDK that we are using to develop the GitOps toolkit controllers. Feel free to create Prometheus rules and drop things that you don't need or open an issue on controller-runtime.

@stefanprodan
Copy link
Member

@paulfantom
Copy link
Contributor Author

These metrics come from controller-runtime, it's the Kubernetes SDK that we are using to develop the GitOps toolkit controllers.

Sorry, but that is only an excuse and not really a fix :) The issue is still present in flux, even if it cause by an upstream library misbehavior.

For anyone who finds this issue in the future, here is a relablling that removes all rest_client_request_latency_seconds_.* metrics (including ones that are used in one, relatively meaningless, panel of flux dashboard): https://github.com/thaum-xyz/ankhmorpork/blob/d24dd02a77479c7884965a23b394a9ee86f279a4/base/flux-system/podmonitor.yaml#L26-L30

@selaux
Copy link

selaux commented Oct 25, 2021

We also encountered this issue and had to disable prometheus scraping for flux, as the costs were not justifyable. It has been fixed in the controller-runtime library version 0.10.0 upwards. Any chance we will get an update?

@stefanprodan
Copy link
Member

stefanprodan commented Oct 25, 2021

We are rolling the update to all Flux controllers, in the latest release some of them are already on controller-runtime v0.10.2. Once all of them will be updated I will close this issue.

@stefanprodan
Copy link
Member

As of flux 0.24.0, all controllers have been update to controller-runtime v0.10 so this issue is finally fixed.

Now we need to remove the graph using rest_client_request_latency_seconds from our Grafana dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants