Bring your own prometheus #16

Pothulapati · 2020-04-15T14:49:36Z

Rendered: https://github.com/Pothulapati/rfc/blob/prom/design/0003-byop.md

Bring your own prometheus

This RFC aims to move prometheus as an add-on i.e not mandated but advised.
This includes moving the linkerd-prometheus manifests as a sub-chart and
allowing more configuration (more on this later), along with documentation
and tooling to use Linkerd with an existing prometheus installation.

Problem Statement

Lots of users already have a exisiting prometheus installation in their
cluster and don't like managing another prometheus instance especially
because they could be resource intensive and painful. They usually
have a prometheus installation paired up with a long term monitroing
solutions like Cortex,InfluxDB, or third party solutions like Grafana
Cloud, etc. In those cases prometheus is just a temporary component, that
keeps pushing metrics into their long term solution, making users want to
re-use their exisitng prometheus.

The following are the list of problems that we plan to solve:

Hard requirement of linkerd-prometheus.
Configuring the current linkerd-prometheus for remote-writes, etc is
hard, because the prometheus configmap is overridden after upgrades.
Not enough documentation and tooling for bringing your own prometheus
use-case.

Once the RFC is implemented, Users should not only be able to use their
exisiting prometheus, but configuration of the linkerd-prometheus for
remote-writes, etc should also be easy.

Signed-off-by: Tarun Pothulapati tarunpothulapati@outlook.com

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

grampelberg · 2020-04-15T15:17:26Z

design/0003-byop.md

+
+Once the RFC is implemented, Apart from the default prometheus enabled installation Users should be
+able to use their exising prometheus instance, to also scrape metrics from the proxy
+


Would you mind outlining the UX side of this a little? We want to make sure that none of the existing linkerd tooling breaks (instead using their prometheus for it all).

I added

### User Expereince Currently, `public-api`, `grafana` and `heartbeat` are the only three components which are configured with a `prometheus-url` parameter, All the Dashboards and CLI are powered through the `public-api`. This allows us to swap prometheus without breaking any exisiting linkerd tooling. Though it is expected that the `scrape-config` is configured correctly, when this path is choosen.

Are you looking for anything more?

During implementation, I think we should audit all Prometheus queries from Linkerd, to ensure they are scoped to job="linkerd-proxy", or similar. We don't want Linkerd accidentally querying for non-Linkerd metrics.

Hmm that makes sense, I will definitely see if there are any places where we use metrics that are scraped from stuff outside the scrape-config provided in documentation, and raise issues (if there are any)

design/0003-byop.md

grampelberg · 2020-04-15T15:19:43Z

design/0003-byop.md

+By moving Prometheus as an add-on, We can make it as an optional component that can be used both through Helm and Linkerd CLI.
+The add-on model would well with upgrades, etc as per [#3955](https://github.com/linkerd/linkerd2/pull/3955). This is done by
+moving the prometheus components into a separate sub-chart in the linkerd repo
+and updating the code. The same health checks are also expected, but as a separate


I think we're gonna need a different set of health checks for BYO prometheus.

Yep, Theres a section later in the doc about Linkerd check tooling. This was about having the current linkerd checks that we do for control-plane components for the prometheus add-on too.

grampelberg · 2020-04-15T15:20:30Z

design/0003-byop.md

+  name: linkerd-prometheus
+  url: 
+  image: 
+  global:


Isn't it possible to load all these from files? Then folks could just manage configmaps and not actually change the chart in any way.

I think the first step is to extract our current Prometheus configmap into a separate yaml template (currently it lives together with the prom service and deployment). That configmap is currently being reworked in linkerd/linkerd2#4220, which adds the ability to plug-in recording and alerting rules through separate configmaps (no other prom configs items are pluggable like that from what I can tell).
Then this new Prometheus addon values.yaml could just have a few minimal entries like enabled, image, etc, and the new entries from that PR to configure rules. We could also have an entry to entirely override the provided Prometheus configmap with an external one, in which case the user would need to be advised to follow the scrape configs from the default configmap.
Makes sense?

As @alpeb mentioned, I integrated stuff from linkerd/linkerd2#4220 into this prometheus field. It dosen't help a lot, as only rules, etc can be added as files and other configuration like remote_write, alerting etc feels better to be taken directly.

Then this new Prometheus addon values.yaml could just have a few minimal entries like enabled, image, etc, and the new entries from that PR to configure rules.

Users can always choose to ignore fields that he dont want to fill!

We could also have an entry to entirely override the provided Prometheus configmap with an external one, in which case the user would need to be advised to follow the scrape configs from the default configmap.

This is interesting but it essentially feels like a BYOP case? and requires complication on maintaing the linkerd-addons configmap.

grampelberg · 2020-04-15T15:21:02Z

design/0003-byop.md

+
+## Using Exisiting Prometheus
+
+First, Documentation will be provided at `linkerd.io` to guide users on adding


I'm not sure it is in-scope for this, but it would be awesome if we could make the scrap config update automatically like we've done for the CLI args and help.

From what I've spoke to @alpeb, This was definitely considered in-scope for the problem. We could definitely write some CLI tools for supporting, or even automatically updating the prom config with the current scrape-config, but having documentation on what the current scrape-config is, how to add it, etc should be good right?

design/0003-byop.md

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

alpeb · 2020-04-16T15:31:57Z

design/0003-byop.md

+|---------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| true    | ""               | Install linkerd-prometheus, and configure control-plane components to use it.                                                                                                                                                                    |
+| false   | "prometheus.xyz" | No linkerd-prometheus, control-plane components will be configured to url                                                                                                                                                                        |
+| true    | "prometheus.xyz" | Install linkerd-prometheus, but stil configure control plane to use different url, useful when linkerd-prometheus is configured to remote-write into long term storage solution at url, Thus  powering dashboard and CLI through long term storage. |


This is really interesting. Would it make sense to have some validation in this case ensuring that remote-write is setup?

Yep, that make sense.Should we consider it in the hot path i.e to make the public-api perform the check when initializing or have Linkerd check tooling that checks this?

That can be checked at install/upgrade time by adding a Helm validation like what we do in indentity.yaml for ensuring a cert is provided. I can't recall if that also gets applied during CLI install/upgrade; if not, then that'd be something to consider in the CLI scripts.

alpeb · 2020-04-16T15:42:11Z

design/0003-byop.md

+- Check if the required scrape-configs are present in proemetheus cm.
+( It should be noted that, the prometheus could be present in an differents
+namespace, etc)
+
+- Check if the common metrics are queriable in prometheus, by using the
+end-point i.e `prometheus.url` from the `linkerd-values` cm.


A downside of trying to check the configmap is that we might not know which one it is, nor have access to it.

A downside of querying prometheus is that that might not give us enough info to verify the latest scrape config has been adopted (some subtle scrape config changes might not be reflected easily as a query result). I wonder if we could add some sort of version number in the scrape config that could be reflected as a query result...

A downside of trying to check the configmap is that we might not know which one it is, nor have access to it.

Hmmm, we can mitigate this, by making the user pass the configmap name to the CLI, but there will still be the access problem but the good side should be that we can also have a command which tires to update the scrape config to the current version's scrape config.

I wonder if we could add some sort of version number in the scrape config that could be reflected as a query result...

This is definitely interesting. but from what I searched, I couldn't find a way to add a param into the scrape-config but can we have a metric on the scrape-config version? If yes who would be the right candidate to check and add for the metric? Also, wondering if there would be any inconsistencies on the actual config, and this metric.

design/0003-byop.md

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

Pothulapati · 2020-04-21T20:13:16Z

A POC of this rfc is available here, just to make sure there are no blockers.

siggy

Overall looks great!

design/0003-byop.md

siggy · 2020-04-23T19:21:44Z

design/0003-byop.md

+
+Once the RFC is implemented, Apart from the default prometheus enabled installation Users should be
+able to use their exising prometheus instance, to also scrape metrics from the proxy
+


During implementation, I think we should audit all Prometheus queries from Linkerd, to ensure they are scoped to job="linkerd-proxy", or similar. We don't want Linkerd accidentally querying for non-Linkerd metrics.

siggy · 2020-04-23T19:23:42Z

design/0003-byop.md

+First, Documentation will be provided at `linkerd.io` to guide users on
+adding the `scrape-configs` required, to get the metrics from linkerd
+proxies, control plane components, etc. This can be found
+[here](https://github.com/linkerd/linkerd2/blob/master/charts/linkerd2/templates/prometheus.yaml#L26).
+This can also be automated like we already do with Linkerd CLI help and args.


I agree some automation via CLI is useful. Some kind of CLI command that outputs the Prometheus ConfigMap necessary to scrape Linkerd proxies. We don't change that scrape config often, but we'll want to ensure that the ConfigMap provided to users matches their Linkerd installation.

Additionally, we should have some documentation on the storage requirements for Linkerd metrics in Prometheus. We workaround this today by having a 6h retention time, but this could become a significant issue for existing Prometheus installations with longer retention times. This could motivate customizations to the scrape config (like scrape interval changes), to help decrease storage requirements.

Added this. On the additional documentation part, I definitely think we have some stuff in mind around retention times, etc, and more can be gathered as we understand problems when users use their existing prometheus.

design/0003-byop.md

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

alpeb

Thanks @Pothulapati , I think this covers all we have in mind, and should be enough to start implementation 👍

grampelberg

Ship it!

adleong · 2020-05-06T23:06:56Z

design/0003-byop.md

+have a prometheus installation paired up with a long term monitroing
+solutions like Cortex,InfluxDB, or third party solutions like Grafana
+Cloud, etc. In those cases prometheus is just a temporary component, that
+keeps pushing metrics into their long term solution, making users want to


I'm not sure I really understand what users don't like about the existence of more than one Prometheus in their cluster. What is the actual downside to having Linkerd-prometheus as a temporary storage for metrics before they are pulled into long-term storage via federation or remote-writes? Does the linkerd-prometheus have a high resource cost? Can we quantify this? Is there some greater operational burden to having the linkerd-prometheus exist? what is that burden?

Yep, linkerd-prometheus isn't any different from the normal installation, and dosen't also get any specific tweaks for Linkerd. I definitely think users should be using the linkerd-prometheus as it dosen't do anything different and the operational burden shouldn't matter that much if there is already a Prometheus and long-term solution. We will be suggesting the same in the documentation.

On the resource cost,
We have the following in ha mode:

prometheusResources: cpu: limit: "4" request: 300m memory: limit: 8192Mi request: 300Mi

This isn't a lot but we have definitely seen cases where prometheus was being OOM, etc. I don't think this will be any better in the BYOP case, and the problem could be further amplified.

I think, It's just one of those obv questions that users stumble upon initially "Can I use my existing prometheus?" but it would later make sense as they start using, to have separate one for Linkerd.

IMO the main value of this rfc is separating out prometheus from the contorl-plane components in charts, configuration and having a life-cycle of its own for updating configuration, etc. BYOP is just another benefit that we are getting out of this. WDYT?

adleong · 2020-05-06T23:08:02Z

design/0003-byop.md

+
+The following are the list of problems that we plan to solve:
+
+- Hard requirement of linkerd-prometheus.


What's the actual problem that this solves, though? Why is hard requirement of linkerd-prometheus bad?

This is a very good decision technically as there are clear separation of concerns here but a lot of people consider it as a shortcoming and have been asking the bringing their own prometheus use-case.

adleong · 2020-05-06T23:08:21Z

design/0003-byop.md

+
+- Hard requirement of linkerd-prometheus.
+
+- Configuring the current linkerd-prometheus for remote-writes, etc is


adleong · 2020-05-06T23:24:40Z

design/0003-byop.md

+- Configuring the current linkerd-prometheus for remote-writes, etc is
+hard, because the prometheus configmap is overridden after upgrades.
+
+- Not enough documentation and tooling for bringing your own prometheus


I think it's very important that we give a clear rubric in the documentation that allows users to figure out if BYO prometheus is something that they should do. I think that for the (vast?) majority of users, even users that are already running a prometheus, using linkerd-prometheus in conjunction with their own prometheus is the right thing to do.

Yep, Totally. and we will definitely have to highlight it in the documentation of BYOP case.

adleong · 2020-05-06T23:30:06Z

design/0003-byop.md

+have the relevant scrape-configs and default configuration.
+
+
+### Prior art


It would be helpful to describe alternative solutions that we considered and explored and why they were not suitable.

For examples:

agressive tuning of linkerd-prometheus

exposing more tuning knobs of linkerd-prometheus

dropping labels

exposing more tuning knobs of linkerd-prometheus

This rfc address this by exposing the global field at prometheus.global where most tuning around retention period, etc can be configured.

All the tuning stuff though being very helpful to make the linkerd-prometheus more stable, managable and not run into situations like OOM, etc when there is sudden spikes in queries, etc still won't address the user's question of getting their own prometheus, which we definitely heard a lot! (definitely not suggestible though)

I added this suggestion. Feel free to comment, if you would like me to add anything!

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

adleong

I understand the motivation and goals for this project and I think everyone involved does as well, but I want to call out that I don't think the problem statement in this RFC accurately reflects that. I don't know if that should be a blocker to this moving forward or not.

The design plan looks great to me.

adleong · 2020-05-13T17:56:50Z

design/0003-byop.md

+
+## Problem Statement (Step 1)
+
+[problem-statement]: #problem-statement


I think this problem statement misses the point, at least as far as I understand it.

Having an existing prometheus installation is not, on its own, a good reason to omit linkerd-prometheus. Furthermore, in the vast majority of cases, there is no significant resource or operational cost to running linkerd-prometheus. This is true when linkerd-prometheus is used on its own for short-term retention and when linkerd-prometheus is paired with another solution for long-term storage.

As far as I'm aware the only good reason for bringing your own prometheus is if you operate at a large scale or have very specific prometheus configuration requirements and need to do extensive tuning or configuration of prometheus.

Add BYOP design doc

4102db7

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

grampelberg reviewed Apr 15, 2020

View reviewed changes

formatting fixes

5640e09

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

alpeb reviewed Apr 16, 2020

View reviewed changes

design/0003-byop.md Outdated Show resolved Hide resolved

Pothulapati added 3 commits April 20, 2020 10:35

address feedback

b6daa9e

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

Merge branch 'master' into prom

4946ab8

update to latest rfc template

920195f

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

Pothulapati mentioned this pull request Apr 20, 2020

Prometheus Alertmanager + extra args + rules linkerd/linkerd2#4220

Merged

update prior art work

2d62ec9

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

siggy reviewed Apr 23, 2020

View reviewed changes

Pothulapati added 4 commits May 4, 2020 06:56

add rfc pr link

1494ace

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

replace with new addon cm

2ca6d6b

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

improve cli tooling part

fe0c148

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

incorporate new prom configuration

17b30f5

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

alpeb approved these changes May 5, 2020

View reviewed changes

admc requested review from adleong, grampelberg and olix0r May 6, 2020 16:07

grampelberg approved these changes May 6, 2020

View reviewed changes

adleong reviewed May 6, 2020

View reviewed changes

address feedback on prometheus tuning

80074f1

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

This was referenced May 8, 2020

Move Prometheus as an Add-On linkerd/linkerd2#4362

Merged

Bring your own Prometheus task linkerd/website#722

Closed

Add global.prometheusUrl field for byop use-case linkerd/linkerd2#4390

Merged

adleong approved these changes May 13, 2020

View reviewed changes

grampelberg merged commit 1c928a5 into linkerd:master May 18, 2020

Pothulapati mentioned this pull request Jul 7, 2020

Remove/Relax prometheus related checks linkerd/linkerd2#4724

Merged

This was referenced Jul 14, 2020

prometheus: add add-on checks linkerd/linkerd2#4756

Merged

add new configuring external prom task linkerd/website#765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring your own prometheus #16

Bring your own prometheus #16

Pothulapati commented Apr 15, 2020 •

edited

Loading

grampelberg Apr 15, 2020

Pothulapati Apr 20, 2020

siggy Apr 23, 2020

Pothulapati May 4, 2020

grampelberg Apr 15, 2020

Pothulapati Apr 15, 2020

grampelberg Apr 15, 2020

alpeb Apr 16, 2020

Pothulapati May 4, 2020

grampelberg Apr 15, 2020

Pothulapati Apr 20, 2020

alpeb Apr 16, 2020

Pothulapati Apr 20, 2020

alpeb Apr 20, 2020

alpeb Apr 16, 2020

Pothulapati Apr 20, 2020

Pothulapati commented Apr 21, 2020

siggy left a comment

siggy Apr 23, 2020

siggy Apr 23, 2020

Pothulapati May 4, 2020

alpeb left a comment

grampelberg left a comment

adleong May 6, 2020

Pothulapati May 7, 2020 •

edited

Loading

adleong May 6, 2020

Pothulapati May 7, 2020

adleong May 6, 2020

adleong May 6, 2020

Pothulapati May 7, 2020

adleong May 6, 2020

Pothulapati May 7, 2020

adleong left a comment

adleong May 13, 2020


		Once the RFC is implemented, Apart from the default prometheus enabled installation Users should be
		able to use their exising prometheus instance, to also scrape metrics from the proxy


		## Using Exisiting Prometheus

		First, Documentation will be provided at `linkerd.io` to guide users on adding


		The following are the list of problems that we plan to solve:

		- Hard requirement of linkerd-prometheus.


		- Hard requirement of linkerd-prometheus.

		- Configuring the current linkerd-prometheus for remote-writes, etc is

		have the relevant scrape-configs and default configuration.


		### Prior art


		## Problem Statement (Step 1)

		[problem-statement]: #problem-statement

Bring your own prometheus #16

Bring your own prometheus #16

Conversation

Pothulapati commented Apr 15, 2020 • edited Loading

Bring your own prometheus

Problem Statement

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pothulapati commented Apr 21, 2020

siggy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alpeb left a comment

Choose a reason for hiding this comment

grampelberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pothulapati May 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adleong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pothulapati commented Apr 15, 2020 •

edited

Loading

Pothulapati May 7, 2020 •

edited

Loading