-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs/philosophy: Initial commit #1054
base: main
Are you sure you want to change the base?
Changes from 2 commits
5292ccc
102961b
b61084f
ac67a8c
63ce534
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
--- | ||
title: Goals and Non-Goals | ||
sort_rank: 3 | ||
--- | ||
|
||
# Goals | ||
|
||
## Resilience | ||
|
||
First and foremost, Prometheus must be resilient in operation. | ||
|
||
|
||
## Reliable alerting | ||
|
||
As a monitoring system, Prometheus is being relied upon to alert humans that | ||
they need to take action in order to prevent undesired system state. | ||
|
||
Thus, its most important function is to keep the pipeline of ingestion, rule | ||
evaluation, and alert notifications working. | ||
|
||
The second most important function is to give humans context about these alerts | ||
by allowing access to the most recent data Prometheus ingested | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fullstop I'd add something around data that's good enough to make useful engineering decisions, rather than 100% perfect data. |
||
|
||
### Resulting design decisions and patterns | ||
|
||
Note that this goal might result in widely different design decisions and thus | ||
operational patterns for different parts of our ecosystem: | ||
|
||
For Prometheus itself, this means running every instance as an island of data | ||
completely detached from every other instance. | ||
|
||
For Alertmanager on the other hand, it means the exact opposite: meshing all | ||
instances closely together, sharing knowledge about alerts and their | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is too vague. The important point is that it's AP rather than CP There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please expand those two. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That didn't click in this context, but yes, we can simply refer to Brewer. It might lead to quite a bit more explanation around why we chose which. I would then be tempted to pull this in the design decisions simply link to each other; if linking at all. |
||
notifications. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would not say that they are closely meshed, just that they are meshed, as they can lose their mesh by design. I would also then specify what happens when they can't mesh. |
||
|
||
## Simple operation | ||
|
||
Operation of Prometheus should be as simple and failure-tolerant as possible. We | ||
try to put required complexity into earlier phases, going through them less | ||
often and ideally still while under the control of a smaller subset of people. | ||
|
||
One example of this would be the preference of statically linked binaries over | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is just an artifact of how Go works, it's not a project philosophy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is still here, this is not a goal of the project. |
||
dynamically-built ones. | ||
|
||
## Keep dependencies clear and limited | ||
|
||
Any non-trivial system needs to integrate with other systems. To keep the | ||
resulting complexity low, we will always try to have the fewest interfaces | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe mention to make debugging/understandability easier? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not seeing this change. |
||
possible and keep their resulting complexity as low as possible. | ||
|
||
## Automation | ||
|
||
Computers are good at doing the same thing over and over again, and quickly. | ||
Humans tend to be better at creative tasks. | ||
|
||
Prometheus will always strive to automate away all tasks whenever possible | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see how this ties into our philosophy at all. Any automation like this is outside the scope of Prometheus. What we do have is a belief that alerts going to humans should require intelligent human action. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. Maybe RichiH was trying to say that SD (because it's mentioned as an example) dynamically discovering targets is more "automatic" than e.g. an old-style static Nagios hosts configuration? But the section also feels a bit unclear / out of place to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1. The automation piece fits somewhere as advice - don't do this without CM, etc - maybe this is more about configuration and dynamism? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems more like a place to link to My Philosophy on Alerting. I'd also caution on putting automation as a key goal for users, elimination is better. |
||
through various means; some specific implementations would be service discovery, | ||
label rewriting, and alert generation. | ||
|
||
|
||
# Non-Goals | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is where we put in stuff about assuming you already have CM/DNS/service database/machine database etc., tying back to laying/doing one thing well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You mean to put in that have CMDB etc explicitly outside our scope? Isn't that obvious? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We get requests in this area fairly regularly. |
||
|
||
# Event handling | ||
|
||
Prometheus is dealing with metrics. As such, it will never process and store | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is incorrect, the pushgateway takes in an event when a batch job ends. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you need to expand on this - perhaps even define an event from a metric? For example, my immediate thought for a new user was: "Wait, I can decorate a metric with event-like labels? Is it then not an event?" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And we get into exemplars |
||
events. | ||
|
||
The only exception in our ecosystem is Alertmanager which deals with individual | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alerts aren't events. I don't see how this is helping define things. |
||
alerts and alert groups. | ||
|
||
For ways to deal with events, see TODO patterns. | ||
|
||
# Push-type system | ||
|
||
Prometheus is, and always will be, a pull-type system. We strongly believe that | ||
this makes operational sense in all but the very largest of scales. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is incorrect, we believe it works at all scales. Also both ways work at scale. The more salient point is that a lot of things we do only really work with push. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would argue that Monarch-style mixed systems allow for even more scale. This is beyond this document's scope, though. Also, I think you meant pull. It makes sense to expand on those things, though. will put in a TODO into patterns and link back to there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I disagree, both can be scaled indefinitely. Also Monarch is more pull than push. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, there's no limits to scale with either approach, but there's the usual known pros+cons for both approaches, and we're quite married to the pull approach, as Prometheus is so designed around being in control of pulling and processing data when it wants. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd just say the first sentence and then link to a blog post (Julius' from a while back perhaps or the FAQ entry. To the vast majority of users this is a non-issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yet many users continue to try and make it do push (and tend to be quite unhappy when told their approach won't work and/or be supported), so I think it's worth a few sentences. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed. |
||
|
||
For ways to integrate with push-type systems, see TODO patterns. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
title: Philosophy | ||
sort_rank: 99 | ||
nav_icon: flask | ||
--- |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
--- | ||
title: Philosophy | ||
sort_rank: 2 | ||
--- | ||
|
||
# Do one thing well | ||
|
||
We believe in the [Unix | ||
philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), abridged from [Doug | ||
McIlroy's initial version from 1978](http://emulator.pdp-11.org.ru/misc/1978.07_-_Bell_System_Technical_Journal.pdf): | ||
|
||
1. Make each program do one thing well. | ||
While the scope of "one thing" invariably encompasses more and more | ||
elements due to increased overall system and computing complexity, we are | ||
still doing one thing: ingest metric data, do computations on it, and expose | ||
it to other systems. | ||
2. Expect the output of every program to become the input to another, as yet unknown, program. | ||
Today's lingua franca is HTTP endpoints, which are used by Prometheus | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. HTTP endpoints with JSON But this isn't what we use everywhere. We've our own text format, and file_sd and yaml for config files. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's why I didn't put JSON. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But we're also doing many things that aren't HTTP. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there something you would propose instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd remove it. |
||
extensively. | ||
In the same vein, Prometheus relies heavily on its own libraries and strict | ||
layering internally. | ||
3. Design and build software, even operating systems, to be tried early, ideally within weeks. Don't hesitate to throw away the clumsy parts and rebuild them. | ||
Prometheus will always be available for free as in beer and as in speech. | ||
We ensure that master always builds, called Continuous Integration these days, | ||
and we not afraid to replace whole sections of our codebase, e.g. our storage | ||
engine. | ||
4. Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you've finished using them. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand what you're saying here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a literal quote. I can shorten it to stop at the first comma, though. Would that make more sense to you? The "unskilled" part is an artefact; back then it was common to assign many tasks to what were basically filing clerks. They had the IT knowledge you would expect them to have. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a bit out of the scope of the project, as we don't run an operations team. It certainly has nothing to do with the Unix philosophy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm yeah, I think some of these points sound a bit too forced in the context of Prometheus. I do agree about the "do one thing well" (which I would rather phrase as "keep pieces as simple as possible") and "open interfaces" points and would focus on those, while abandoning the others. I'm not sure we need to explicitly try to tie things so ceremoniously to the Unix philosophy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed. |
||
While this is an outdated way of stating the goal, automation where possible | ||
is still one of the core characteristics any modern philosophy. | ||
|
||
## Embrace cloud-native technologies | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not our philosophy, in fact I'd say this explicitly isn't our philosophy. We're a technology that happens to work well for things that fall into the "cloud native" marketing term - but we also work just as well outside of that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With that structure, I wanted to say that this works well with good operations, but re-reading, I can see how that got lost over time. Would you think it better to remove this or to expand on how cloud-native is one of the many facets of proper operations? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd have a section saying that we work across many paradigms, but aren't going to do things just for the sake of one system that is weird/missing a feature. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1. Maybe some (more?) ideal use cases? Although that might drift over time and become stale. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rewritten completely & moved down. |
||
|
||
In many ways, the cloud-native approach mirrors the Unix philosophy, updating it | ||
for the modern world. | ||
|
||
1. Micro-services are the equivalent of doing one thing well | ||
2. Ubiquitous APIs enable interoperability in the cloud-native world | ||
3. Releasing early, releasing often, and failing quickly is important when | ||
failure is part of expected operations | ||
4. Automation is key, freeing up humans to make more useful use of their time | ||
|
||
# Be pragmatic | ||
|
||
To not lose focus, we need to be honest to our users and ourselves about what we | ||
can do and not do. | ||
|
||
# Be open | ||
|
||
We will always put as much of our code, discussions, presentations, and other | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you're missing an "as possible" here gramatically There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done. |
||
content into a form and place which is accessible in the long term, free of | ||
charge. | ||
|
||
# Play well with others | ||
|
||
Prometheus is a project of convicted and passionate individuals. As we do not | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think convicted was the word you meant There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔒 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ...are you sure...? |
||
have a profit motive, nor quarterly projections, or any other requirement to | ||
meet arbitrary business requirements, we can focus on getting things right. This | ||
also means that we are free to suggest other implementations and projects if | ||
they are a better fit for a particular use-case. | ||
|
||
# Be inclusive | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems weird to mix this in with the other technical stuff There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there another place where this is mentioned? I feel like this needs to be somewhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Only indirectly in our governance: https://prometheus.io/governance/#values Maybe if/once we have a general section for developers and getting into Prometheus contributions, it would be a better fit there? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is very vision statementy. It has little to no bearing on the design decisions we make. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we keep it in there while we don't have a better place, yet? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Put in a TODO for now. |
||
|
||
We strongly believe that technology should be accessible to all. As such, we | ||
will always strive to be welcoming to everyone. | ||
|
||
As an example of this, many of us are investing their personal time helping | ||
individuals or communities by educating and helping them to be more productive | ||
in the tech sector, as well as sponsoring diversity efforts, for example paying | ||
for travel and accommodation at [PromCon](https://promcon.io). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
--- | ||
title: Philosophy overview | ||
sort_rank: 1 | ||
--- | ||
|
||
We have structured our considerations for developing and working with Prometheus | ||
into four groups in decreasing order of importance, and likelihood to change. | ||
|
||
1. Philosophy | ||
2. Goals and Non-Goals | ||
3. Design decisions | ||
4. Patterns and Anti-Patterns | ||
|
||
While we are not implying that anything in this section is likely to change at | ||
all, it's still more likely for us to change a particular pattern over time than | ||
the underlying philosophy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we miss an introduction here