Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring your own prometheus #16
Bring your own prometheus #16
Changes from 6 commits
4102db7
5640e09
b6daa9e
4946ab8
920195f
2d62ec9
1494ace
2ca6d6b
fe0c148
17b30f5
80074f1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this problem statement misses the point, at least as far as I understand it.
Having an existing prometheus installation is not, on its own, a good reason to omit linkerd-prometheus. Furthermore, in the vast majority of cases, there is no significant resource or operational cost to running linkerd-prometheus. This is true when linkerd-prometheus is used on its own for short-term retention and when linkerd-prometheus is paired with another solution for long-term storage.
As far as I'm aware the only good reason for bringing your own prometheus is if you operate at a large scale or have very specific prometheus configuration requirements and need to do extensive tuning or configuration of prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I really understand what users don't like about the existence of more than one Prometheus in their cluster. What is the actual downside to having Linkerd-prometheus as a temporary storage for metrics before they are pulled into long-term storage via federation or remote-writes? Does the linkerd-prometheus have a high resource cost? Can we quantify this? Is there some greater operational burden to having the linkerd-prometheus exist? what is that burden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep,
linkerd-prometheus
isn't any different from the normal installation, and dosen't also get any specific tweaks for Linkerd. I definitely think users should be using thelinkerd-prometheus
as it dosen't do anything different and the operational burden shouldn't matter that much if there is already a Prometheus and long-term solution. We will be suggesting the same in the documentation.On the resource cost,
We have the following in
ha
mode:This isn't a lot but we have definitely seen cases where prometheus was being OOM, etc. I don't think this will be any better in the BYOP case, and the problem could be further amplified.
I think, It's just one of those obv questions that users stumble upon initially "Can I use my existing prometheus?" but it would later make sense as they start using, to have separate one for Linkerd.
IMO the main value of this rfc is separating out prometheus from the contorl-plane components in charts, configuration and having a life-cycle of its own for updating configuration, etc. BYOP is just another benefit that we are getting out of this. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the actual problem that this solves, though? Why is hard requirement of linkerd-prometheus bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good decision technically as there are clear separation of concerns here but a lot of people consider it as a shortcoming and have been asking the bringing their own prometheus use-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's very important that we give a clear rubric in the documentation that allows users to figure out if BYO prometheus is something that they should do. I think that for the (vast?) majority of users, even users that are already running a prometheus, using linkerd-prometheus in conjunction with their own prometheus is the right thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, Totally. and we will definitely have to highlight it in the documentation of BYOP case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind outlining the UX side of this a little? We want to make sure that none of the existing linkerd tooling breaks (instead using their prometheus for it all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added
Are you looking for anything more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During implementation, I think we should audit all Prometheus queries from Linkerd, to ensure they are scoped to
job="linkerd-proxy"
, or similar. We don't want Linkerd accidentally querying for non-Linkerd metrics.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that makes sense, I will definitely see if there are any places where we use metrics that are scraped from stuff outside the scrape-config provided in documentation, and raise issues (if there are any)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it possible to load all these from files? Then folks could just manage configmaps and not actually change the chart in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first step is to extract our current Prometheus configmap into a separate yaml template (currently it lives together with the prom service and deployment). That configmap is currently being reworked in linkerd/linkerd2#4220, which adds the ability to plug-in recording and alerting rules through separate configmaps (no other prom configs items are pluggable like that from what I can tell).
Then this new Prometheus addon
values.yaml
could just have a few minimal entries likeenabled
,image
, etc, and the new entries from that PR to configure rules. We could also have an entry to entirely override the provided Prometheus configmap with an external one, in which case the user would need to be advised to follow the scrape configs from the default configmap.Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @alpeb mentioned, I integrated stuff from linkerd/linkerd2#4220 into this
prometheus
field. It dosen't help a lot, as only rules, etc can be added as files and other configuration likeremote_write
,alerting
etc feels better to be taken directly.Users can always choose to ignore fields that he dont want to fill!
This is interesting but it essentially feels like a BYOP case? and requires complication on maintaing the linkerd-addons configmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree some automation via CLI is useful. Some kind of CLI command that outputs the Prometheus ConfigMap necessary to scrape Linkerd proxies. We don't change that scrape config often, but we'll want to ensure that the ConfigMap provided to users matches their Linkerd installation.
Additionally, we should have some documentation on the storage requirements for Linkerd metrics in Prometheus. We workaround this today by having a 6h retention time, but this could become a significant issue for existing Prometheus installations with longer retention times. This could motivate customizations to the scrape config (like scrape interval changes), to help decrease storage requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this. On the additional documentation part, I definitely think we have some stuff in mind around retention times, etc, and more can be gathered as we understand problems when users use their existing prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really interesting. Would it make sense to have some validation in this case ensuring that remote-write is setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that make sense.Should we consider it in the hot path i.e to make the public-api perform the check when initializing or have Linkerd check tooling that checks this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can be checked at install/upgrade time by adding a Helm validation like what we do in
indentity.yaml
for ensuring a cert is provided. I can't recall if that also gets applied during CLI install/upgrade; if not, then that'd be something to consider in the CLI scripts.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to describe alternative solutions that we considered and explored and why they were not suitable.
For examples:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rfc address this by exposing the
global
field atprometheus.global
where most tuning around retention period, etc can be configured.All the tuning stuff though being very helpful to make the linkerd-prometheus more stable, managable and not run into situations like OOM, etc when there is sudden spikes in queries, etc still won't address the user's question of getting their own prometheus, which we definitely heard a lot! (definitely not suggestible though)
I added this suggestion. Feel free to comment, if you would like me to add anything!