Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/philosophy: Initial commit #1054

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions content/docs/philosophy/goals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
title: Goals and Non-Goals
sort_rank: 3
---

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we miss an introduction here

# Goals

## Resilience

First and foremost, Prometheus must be resilient in operation.


## Reliable alerting

As a monitoring system, Prometheus is being relied upon to alert humans that
they need to take action in order to prevent undesired system state.

Thus, its most important function is to keep the pipeline of ingestion, rule
evaluation, and alert notifications working.

The second most important function is to give humans context about these alerts
by allowing access to the most recent data Prometheus ingested
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fullstop

I'd add something around data that's good enough to make useful engineering decisions, rather than 100% perfect data.


### Resulting design decisions and patterns

Note that this goal might result in widely different design decisions and thus
operational patterns for different parts of our ecosystem:

For Prometheus itself, this means running every instance as an island of data
completely detached from every other instance.

For Alertmanager on the other hand, it means the exact opposite: meshing all
instances closely together, sharing knowledge about alerts and their
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too vague. The important point is that it's AP rather than CP

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand those two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That didn't click in this context, but yes, we can simply refer to Brewer. It might lead to quite a bit more explanation around why we chose which. I would then be tempted to pull this in the design decisions simply link to each other; if linking at all.

notifications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not say that they are closely meshed, just that they are meshed, as they can lose their mesh by design. I would also then specify what happens when they can't mesh.


## Simple operation

Operation of Prometheus should be as simple and failure-tolerant as possible. We
try to put required complexity into earlier phases, going through them less
often and ideally still while under the control of a smaller subset of people.

One example of this would be the preference of statically linked binaries over
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an artifact of how Go works, it's not a project philosophy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still here, this is not a goal of the project.

dynamically-built ones.

## Keep dependencies clear and limited

Any non-trivial system needs to integrate with other systems. To keep the
resulting complexity low, we will always try to have the fewest interfaces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention to make debugging/understandability easier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing this change.

possible and keep their resulting complexity as low as possible.

## Automation

Computers are good at doing the same thing over and over again, and quickly.
Humans tend to be better at creative tasks.

Prometheus will always strive to automate away all tasks whenever possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this ties into our philosophy at all. Any automation like this is outside the scope of Prometheus.

What we do have is a belief that alerts going to humans should require intelligent human action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Maybe RichiH was trying to say that SD (because it's mentioned as an example) dynamically discovering targets is more "automatic" than e.g. an old-style static Nagios hosts configuration? But the section also feels a bit unclear / out of place to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. The automation piece fits somewhere as advice - don't do this without CM, etc - maybe this is more about configuration and dynamism?

Copy link
Contributor

@brian-brazil brian-brazil Jun 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems more like a place to link to My Philosophy on Alerting.

I'd also caution on putting automation as a key goal for users, elimination is better.

through various means; some specific implementations would be service discovery,
label rewriting, and alert generation.


# Non-Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where we put in stuff about assuming you already have CM/DNS/service database/machine database etc., tying back to laying/doing one thing well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to put in that have CMDB etc explicitly outside our scope? Isn't that obvious?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get requests in this area fairly regularly.


# Event handling

Prometheus is dealing with metrics. As such, it will never process and store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, the pushgateway takes in an event when a batch job ends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to expand on this - perhaps even define an event from a metric? For example, my immediate thought for a new user was: "Wait, I can decorate a metric with event-like labels? Is it then not an event?"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we get into exemplars

events.

The only exception in our ecosystem is Alertmanager which deals with individual
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alerts aren't events.

I don't see how this is helping define things.

alerts and alert groups.

For ways to deal with events, see TODO patterns.

# Push-type system

Prometheus is, and always will be, a pull-type system. We strongly believe that
this makes operational sense in all but the very largest of scales.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, we believe it works at all scales.

Also both ways work at scale. The more salient point is that a lot of things we do only really work with push.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that Monarch-style mixed systems allow for even more scale. This is beyond this document's scope, though.

Also, I think you meant pull. It makes sense to expand on those things, though. will put in a TODO into patterns and link back to there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, both can be scaled indefinitely. Also Monarch is more pull than push.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's no limits to scale with either approach, but there's the usual known pros+cons for both approaches, and we're quite married to the pull approach, as Prometheus is so designed around being in control of pulling and processing data when it wants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just say the first sentence and then link to a blog post (Julius' from a while back perhaps or the FAQ entry. To the vast majority of users this is a non-issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet many users continue to try and make it do push (and tend to be quite unhappy when told their approach won't work and/or be supported), so I think it's worth a few sentences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


For ways to integrate with push-type systems, see TODO patterns.
5 changes: 5 additions & 0 deletions content/docs/philosophy/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Philosophy
sort_rank: 99
nav_icon: flask
---
69 changes: 69 additions & 0 deletions content/docs/philosophy/philosophy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: Philosophy
sort_rank: 2
---

# Do one thing well

We believe in the [Unix
philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), abridged from [Doug
McIlroy's initial version from 1978](http://emulator.pdp-11.org.ru/misc/1978.07_-_Bell_System_Technical_Journal.pdf):

1. Make each program do one thing well.
While the scope of "one thing" invariably encompasses more and more
elements due to increased overall system and computing complexity, we are
still doing one thing: ingest metric data, do computations on it, and expose
it to other systems.
2. Expect the output of every program to become the input to another, as yet unknown, program.
Today's lingua franca is HTTP endpoints, which are used by Prometheus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP endpoints with JSON

But this isn't what we use everywhere. We've our own text format, and file_sd and yaml for config files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I didn't put JSON.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we're also doing many things that aren't HTTP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something you would propose instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove it.

extensively.
In the same vein, Prometheus relies heavily on its own libraries and strict
layering internally.
3. Design and build software, even operating systems, to be tried early, ideally within weeks. Don't hesitate to throw away the clumsy parts and rebuild them.
Prometheus will always be available for free as in beer and as in speech.
We ensure that master always builds, called Continuous Integration these days,
and we not afraid to replace whole sections of our codebase, e.g. our storage
engine.
4. Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you've finished using them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you're saying here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a literal quote. I can shorten it to stop at the first comma, though. Would that make more sense to you?

The "unskilled" part is an artefact; back then it was common to assign many tasks to what were basically filing clerks. They had the IT knowledge you would expect them to have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bit out of the scope of the project, as we don't run an operations team. It certainly has nothing to do with the Unix philosophy.

Copy link
Member

@juliusv juliusv Jun 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yeah, I think some of these points sound a bit too forced in the context of Prometheus. I do agree about the "do one thing well" (which I would rather phrase as "keep pieces as simple as possible") and "open interfaces" points and would focus on those, while abandoning the others. I'm not sure we need to explicitly try to tie things so ceremoniously to the Unix philosophy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

While this is an outdated way of stating the goal, automation where possible
is still one of the core characteristics any modern philosophy.

## Embrace cloud-native technologies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not our philosophy, in fact I'd say this explicitly isn't our philosophy.

We're a technology that happens to work well for things that fall into the "cloud native" marketing term - but we also work just as well outside of that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that structure, I wanted to say that this works well with good operations, but re-reading, I can see how that got lost over time.

Would you think it better to remove this or to expand on how cloud-native is one of the many facets of proper operations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have a section saying that we work across many paradigms, but aren't going to do things just for the sake of one system that is weird/missing a feature.

Copy link
Contributor

@jamtur01 jamtur01 Jun 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Maybe some (more?) ideal use cases? Although that might drift over time and become stale.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten completely & moved down.


In many ways, the cloud-native approach mirrors the Unix philosophy, updating it
for the modern world.

1. Micro-services are the equivalent of doing one thing well
2. Ubiquitous APIs enable interoperability in the cloud-native world
3. Releasing early, releasing often, and failing quickly is important when
failure is part of expected operations
4. Automation is key, freeing up humans to make more useful use of their time

# Be pragmatic

To not lose focus, we need to be honest to our users and ourselves about what we
can do and not do.

# Be open

We will always put as much of our code, discussions, presentations, and other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're missing an "as possible" here gramatically

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

content into a form and place which is accessible in the long term, free of
charge.

# Play well with others

Prometheus is a project of convicted and passionate individuals. As we do not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think convicted was the word you meant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...are you sure...?

have a profit motive, nor quarterly projections, or any other requirement to
meet arbitrary business requirements, we can focus on getting things right. This
also means that we are free to suggest other implementations and projects if
they are a better fit for a particular use-case.

# Be inclusive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to mix this in with the other technical stuff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another place where this is mentioned? I feel like this needs to be somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only indirectly in our governance: https://prometheus.io/governance/#values

Maybe if/once we have a general section for developers and getting into Prometheus contributions, it would be a better fit there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very vision statementy. It has little to no bearing on the design decisions we make.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep it in there while we don't have a better place, yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put in a TODO for now.


We strongly believe that technology should be accessible to all. As such, we
will always strive to be welcoming to everyone.

As an example of this, many of us are investing their personal time helping
individuals or communities by educating and helping them to be more productive
in the tech sector, as well as sponsoring diversity efforts, for example paying
for travel and accommodation at [PromCon](https://promcon.io).
16 changes: 16 additions & 0 deletions content/docs/philosophy/philosophy_overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
title: Philosophy overview
sort_rank: 1
---

We have structured our considerations for developing and working with Prometheus
into four groups in decreasing order of importance, and likelihood to change.

1. Philosophy
2. Goals and Non-Goals
3. Design decisions
4. Patterns and Anti-Patterns

While we are not implying that anything in this section is likely to change at
all, it's still more likely for us to change a particular pattern over time than
the underlying philosophy.