linkerd · grampelberg · May 18, 2020 · Apr 15, 2020 · Apr 16, 2020 · Apr 20, 2020
diff --git a/design/0003-byop.md b/design/0003-byop.md
@@ -0,0 +1,202 @@
+# Bring your own Prometheus
+
+- @pothulapati
+- Start Date: 2020-04-15
+- Target Date: 
+- RFC PR: [linkerd/rfc#16](https://github.com/linkerd/rfc/pull/16)
+- Linkerd Issue: [linkerd/linkerd2#3590](https://github.com/linkerd/linkerd2/issues/3590)
+- Reviewers: @grampelberg @alpeb @siggy
+
+## Summary
+
+[summary]: #summary
+
+This RFC aims to move prometheus as an add-on i.e not mandated but advised.
+This includes moving the linkerd-prometheus manifests as a sub-chart and
+allowing more configuration (more on this later), along with documentation
+and tooling to use Linkerd with an existing prometheus installation.
+
+## Problem Statement (Step 1)
+
+[problem-statement]: #problem-statement
+
+Lots of users already have a exisiting prometheus installation in their
+cluster and don't like managing another prometheus instance especially
+because they could be resource intensive and painful. They usually
+have a prometheus installation paired up with a long term monitroing
+solutions like Cortex,InfluxDB, or third party solutions like Grafana
+Cloud, etc. In those cases prometheus is just a temporary component, that
+keeps pushing metrics into their long term solution, making users want to
+re-use their exisitng prometheus.
+
+The following are the list of problems that we plan to solve:
+
+- Hard requirement of linkerd-prometheus.
+
+- Configuring the current linkerd-prometheus for remote-writes, etc is
+hard, because the prometheus configmap is overridden after upgrades.
+
+- Not enough documentation and tooling for bringing your own prometheus
+use-case.
+
+Once the RFC is implemented, Users should not only be able to use their
+exisiting prometheus, but configuration of the linkerd-prometheus for
+remote-writes, etc should also be easy.
+
+## Design proposal (Step 2)
+
+[design-proposal]: #design-proposal
+
+### Prometheus as an Add-On
+
+By moving Prometheus as an add-on, We can make it as an optional component
+that can be used both through Helm and Linkerd CLI. The add-on model would
+well with upgrades, etc as per [#3955](https://github.com/linkerd/linkerd2/pull/3955).
+This is done by moving the prometheus components into a separate sub-chart
+in the linkerd repo and updating the code. The same health checks are also
+expected, but as a separate check Category, which would be transient i.e
+runs only when it recognises that the add-on is enabled.
+
+By default, the prometheus add-on will be **enabled**.
+
+This also provides a single place of configuration for all things related to
+linkerd-prometheus. Though we are making linkerd-prometheus optional but
+having it is great for Linkerd and external users, for reasons like
+separation of concerns, etc. By allowing the `remote-write`, `alerts`, etc
+configuration in the linkerd-prometheus, we would decrease the need/want
+for external users to tie their existing prometheus to Linkerd,
+as it would be even easier for them to configure `remote-write`, etc
+configuration from their exisiting prometheus to the linkerd-prometheus
+right from the start.
+
+The following default prometheus configuration is planned:
+
+```yaml
+prometheus:
+  enabled: true
+  name: linkerd-prometheus
+  image:
+  args:
+  # - log.format: json
+  global:
+    <global-prom-config> https://prometheus.io/docs/prometheus/latest/configuration/configuration/#configuration-file
+  remote-write:
+    <remote-write-config> https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write
+  alertManagers:
+    <alert-manager-config> https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config
+  configMapMounts:
+  # - name: alerting-rules
+  #   subPath: alerting_rules.yml
+  #   configMap: linkerd-prometheus-rules
+```
+
+All the current default configuration will be moved to `values.yaml` in the
+above format, allowing users to override this by using Helm or  
+`--addon-config` through CLI.
+
+### Using Exisiting Prometheus
+
+First, Documentation will be provided at `linkerd.io` to guide users on
+adding the `scrape-configs` required, to get the metrics from linkerd
+proxies, control plane components, etc. This can be found
+[here](https://github.com/linkerd/linkerd2/blob/master/charts/linkerd2/templates/prometheus.yaml#L26).
+Additional documentation on the storage requirements for Linkerd metrics
+in prometheus can also be provided around existing installations with
+longer retention times, scrape interval changes, etc.
+This can also be automated like we already do with Linkerd CLI help and args.
+
+Once this is done, Users should be able to see golden metrics, etc in their
+exisiting prometheus instance. The `.Values.global.prometheusUrl` field helps users in
+configuring Linkerd components i.e `public-api`,etc that needs prometheus
+access to the exisiting prometheus installation.
+
+`prometheusUrl` is present in `.Values.global` instead of `.Values.prometheus.` to allow this field to
+be accessible by other addOn components like `grafana`, etc. `linkerd-config-addons` Configmap will also be updated to
+store the same.
+
+Flag interactions are defined below:
+
+| `prometheus.enabled` | `global.prometheusUrl`              | Result                                                                                                                                                                                                                                           |
+|---------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| true    | ""               | Install linkerd-prometheus, and configure control-plane components to use it.                                                                                                                                                                    |
+| false   | "prometheus.xyz" | No linkerd-prometheus, control-plane components will be configured to url                                                                                                                                                                        |
+| true    | "prometheus.xyz" | Install linkerd-prometheus, but stil configure control plane to use different url, useful when linkerd-prometheus is configured to remote-write into long term storage solution at url, Thus  powering dashboard and CLI through long term storage. |
+| false   | ""               | No linkerd-prometheus, and no URL to power dashboards. (Should we even allow this?)                                                                                                                                                              |
+
+#### User Expereince
+
+Currently, `public-api`, `grafana` and `heartbeat` are the only three
+components which are configured with a `prometheus-url` parameter,
+All the Dashboards and CLI are powered through the `public-api`.
+This allows us to swap prometheus without breaking any exisiting linkerd
+tooling. Though it is expected that the `scrape-config` is configured
+correctly, when this path is choosen.
+
+If `prometheus-url` is not accessible, components like `public-api` should not fail
+but just log some warning. So, that users can still access things like TAP, Checks, etc through the dashboard and CLI.
+
+#### Linkerd CLI Tooling
+
+When users get their own Prometheus, It would be really great to have
+some tooling that allows users to check and troubleshoot if there are any
+problems.
+
+*linkerd check*
+
+Checks can be added to report if the prometheus is configured correctly and is replying to queries.
+
+- Check if the configured prometheus instance is healthy.
+- if the relevant metrics are present in prometheus by querying, and warn the users about the same. User's
+don't have to supply the `prometheusUrl` as we can access the same from `linkerd-config-addons` Configmap.
+- More checks can be added here, if there are some known
+configurations that don't go well with linkerd.
+
+*linkerd prom-config*
+
+A additional command can be provided in the CLI, which outputs the prometheus-config corresponding to that version of linkerd, which would
+have the relevant scrape-configs and default configuration.
+
+
+### Prior art
+
+[prior-art]: #prior-art
+
+- [Prometheus  Operator](https://github.com/coreos/prometheus-operator) is a interesting tool,
+ which can remove the problem of maintaining configuration for upgrades, etc from Linkerd,
+ but comes with it's own controller and CRD's.
+- Specific to resource costs, We can consider things like tuning prometheus, dropping labels, etc making it
+  further lighter but there will still be some users who would want to use their exisitng prometheus. Irrespective
+  of this solution, we should definitely try to be light on linkerd-prometheus and also provide as many knobs for
+  configuration making it easy to manage. This could be addressed separately and [rfc/21](https://github.com/linkerd/rfc/pull/21) has some similarities.
+- Istio's [Bring your own Prometheus example](https://istiobyexample.dev/prometheus/). A similar helm
+based sub-chart model is used but this was with Mixer v1.
+- We already have docs for [exporting metrics](https://linkerd.io/2/tasks/exporting-metrics/).
+We could update those docs, with remote-write, etc.
+
+### Unresolved questions
+
+[unresolved-questions]: #unresolved-questions
+
+- Should we support all prometheus configs in `linkerd-prometheus`?
+
+- How do we keep track of metrics that the proxies expose, to have
+consistent check tooling?
+
+- Most third party storage solutions have some auth mechanisms, should we
+allow that? IMO, we shouldn't focus on this in v1 but as folks start to call
+out we can consider updating to handle that use case.
+
+- Should we separate out each induvidual fields of `proemtheus` i.e `remoteWrite`, etc
+into separate configmaps?
+
+### Future possibilities
+
+[future-possibilities]: #future-possibilities
+
+In this proposal, we separate the query path with the linkerd-promethues,
+Thus allowing users to use any third party long term storage solution that
+follow the prometheus Query API to power Dashboards and CLI.
+
+Also, The same `bring your own prometheus` document can be re-used for
+solutions like [grafana cloud agent](https://grafana.com/blog/2020/03/18/introducing-grafana-cloud-agent-a-remote_write-focused-prometheus-agent-that-can-save-40-on-memory-usage/)
+that follow the same prometheus scrape-config.