-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics-generator: use Prometheus Agent WAL and remote storage #1323
metrics-generator: use Prometheus Agent WAL and remote storage #1323
Conversation
TODO
|
ea81280
to
afa9c01
Compare
val float64 | ||
} | ||
|
||
func TestGenerator(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've deleted these tests in favour of the e2e tests. These tests have to do a lot of config management/mocking to get the generator running and verify the metrics emitted are correctly. This is exactly the same as what the e2e tests do, but their code is a bit simpler.
If anyone feels strong about keeping them, I don't mind reinstating them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Love this changes. Left just two nits. LGTM
Hmm, |
What this PR does:
Replaces our homegrown remote write implementation with the WAL and remote storage implementation from the Prometheus Agent.
We use two components from Prometheus:
agent.DB
: this is the WAL optimised for remote writing metrics only. It's a Prometheus TSDB without the querying, alerting,... capabilities. Whenever we scrape/collect metrics we append them toagent.DB
which will store the samples on disk.remote.Storage
(only the remote write functionality): this tails the WAL and writes data to the configured remote write endpoint(s). It has retry logic and can scale up queues as necessary. It supports a wide range of authorization options and can rewrite labels, see https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_writeMulti-tenancy: Prometheus is not multi-tenant, to support multi-tenancy we create a WAL and remote-writer for every tenant sending data. Each WAL will be stored in
<WAL path>/<tenant ID>
.Resilience: even though the WAL stores samples on disk, it does not make the metrics-generator resilient to crashes. After a restart the remote writer will start from the end of the WAL, even if older data was sent yet. See prometheus/prometheus#8809.
But it will make the metrics-generator more resilient against an outage of the downstream TSDB. Pending samples is stored on disk so we do not lose them or risk running out of memory. This should allow the metrics-generator to overcome short outages.
Which issue(s) this PR fixes:
Related to #1303
Checklist
Documentation addedWill be done in a later PRCHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]