Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbundle Prometheus and Grafana #3406

Closed
andrew-waters opened this issue Sep 9, 2019 · 8 comments
Closed

Unbundle Prometheus and Grafana #3406

andrew-waters opened this issue Sep 9, 2019 · 8 comments

Comments

@andrew-waters
Copy link

andrew-waters commented Sep 9, 2019

Feature Request

What problem are you trying to solve?

linkerd2 deployments come with Grafana and Prometheus bundled into the deployment. This is great for single cluster setups and PoC's but it comes at the cost of resources when an existing cluster already has prometheus running. The documentation around exporting metrics is useful and for the specific use case of running thanos, the service monitor is the preferable route (the purpose of thanos is to avoid federation, which has legitimate issues).

Below is an approximate topology of a multi cluster setup running thanos (as close as possible to the linkerd2 supplied one).

linkerd (1)

As you can see it should in theory be possible to remove prometheus and grafana from the linkerd deployment and maintain metric collection.

However, removing these cause errors with linkerd check and linkerd dashboard (they appear to check for the existence.

How should the problem be solved?

I'm not entirely sure of the scope of changes that are required, but adding a flag to the commands that would otherwise fail may be a useful starting point for discussion:

linkerd check --exclude=prometheus,grafana
linkerd install --exclude=prometheus,grafana
linkerd dashboard

Note that linkerd dashboard fails when prometheus isn't available which suggests there may be some core logic that is shared validation logic between the cli.

Any alternatives you've considered?

The only alternatives are to:

  • ignore this issue and force users to install prometheus and grafana
  • avoid using the CLI

How would users interact with this feature?

Referenced above in bash scripts.


Edit: updated diagram to have a white background for legibility

@grampelberg
Copy link
Contributor

The bundled prometheus is pretty tuned just for Linkerd. We're doing enough things there and with the CLI/dashboard that a central system would likely crash, especially if retention is over 6 hours. Assuming that cost can be kept in check, it feels like an important piece of the puzzle.

Grafana, on the other hand, is 100% optional. There's some ongoing work around configuration that'll make it optional and not break when it isn't installed (dashboard links for example).

@andrew-waters
Copy link
Author

That's interesting. Is linkerd using prometheus as it's own storage? The most common (recommended) tsdb lifecycle in prometheus is 2 hours so that shouldn't cause issues.

Being able to point dashboard links to an URL outside of the cluster would be helpful.

I still maintain that unbundling (optional) is valid if it's explicit what the ramifications may be.

@grampelberg
Copy link
Contributor

Is linkerd using prometheus as it's own storage?

Linkerd doesn't do any storage itself. So, it is either prometheus for metrics or k8s for cluster state and configuration.

is valid if it's explicit what the ramifications may be.

You're totally right, at least some documentation on what happens would be helpful.

@masterkain
Copy link

can someone kindly explain the relation between the linkerd2-prometheus and https://github.com/weaveworks/flagger/blob/master/docs/gitbook/how-it-works.md#http-metrics this?

asking because request-success-rate should be a prometheus metrics but it's not being picked up, I'm unsure if I have to install prometheus or I can reuse the linkerd one.

@grampelberg
Copy link
Contributor

Flagger uses the linkerd prometheus.

@masterkain
Copy link

masterkain commented Nov 1, 2019

Flagger uses the linkerd prometheus.

thanks, does the nginx ingress needs to be meshed too? when I do that it does not work anymore (502)

I'm trying to make canary releases work but I have no idea how flagger asks linkerd-prometheus these metrics during a deployment. I know that this isn't maybe the right place, but I'm stuck, any advice would be appreciated https://gist.github.com/masterkain/75c26bf239ad08400ac40c0a45714b28 I tested from inside the load balancer pod hey -z 2m -q 10 -c 2 http://bstore-stag-puma:3000/elb-status and packets are being sent ok

@grampelberg
Copy link
Contributor

Why don't you jump into slack. It will be easier to help you there. Also:

@Pothulapati
Copy link
Contributor

This is fixed now. Both grafana, prometheus are enabled by default but optional now. Check the configuration fields to see how to disable them during install.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants