metrics-generator: use Prometheus Agent WAL and remote storage #1323

yvrhdn · 2022-03-04T14:00:16Z

What this PR does:
Replaces our homegrown remote write implementation with the WAL and remote storage implementation from the Prometheus Agent.

We use two components from Prometheus:

agent.DB: this is the WAL optimised for remote writing metrics only. It's a Prometheus TSDB without the querying, alerting,... capabilities. Whenever we scrape/collect metrics we append them to agent.DB which will store the samples on disk.
remote.Storage (only the remote write functionality): this tails the WAL and writes data to the configured remote write endpoint(s). It has retry logic and can scale up queues as necessary. It supports a wide range of authorization options and can rewrite labels, see https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write

Multi-tenancy: Prometheus is not multi-tenant, to support multi-tenancy we create a WAL and remote-writer for every tenant sending data. Each WAL will be stored in <WAL path>/<tenant ID>.

Resilience: even though the WAL stores samples on disk, it does not make the metrics-generator resilient to crashes. After a restart the remote writer will start from the end of the WAL, even if older data was sent yet. See prometheus/prometheus#8809.
But it will make the metrics-generator more resilient against an outage of the downstream TSDB. Pending samples is stored on disk so we do not lose them or risk running out of memory. This should allow the metrics-generator to overcome short outages.

Which issue(s) this PR fixes:
Related to #1303

Checklist

Tests updated
~~Documentation added~~ Will be done in a later PR
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

yvrhdn · 2022-03-04T14:00:57Z

TODO

Fix existing tests
Add e2e test
Verify config works as expected

yvrhdn · 2022-03-07T13:17:27Z

modules/generator/generator_test.go

-	val  float64
-}
-
-func TestGenerator(t *testing.T) {


I've deleted these tests in favour of the e2e tests. These tests have to do a lot of config management/mocking to get the generator running and verify the metrics emitted are correctly. This is exactly the same as what the e2e tests do, but their code is a bit simpler.

If anyone feels strong about keeping them, I don't mind reinstating them.

mapno

Nice! Love this changes. Left just two nits. LGTM

integration/e2e/metrics-generator/metrics_generator_test.go

modules/generator/storage/instance.go

yvrhdn · 2022-03-08T17:44:04Z

Hmm, TestInstance_multiTenancy seems to be flaky in CI. I'll see if I can make it more robust.

metrics-generator: use Prometheus Agent WAL and remote storage

afa9c01

yvrhdn force-pushed the kvrhdn/metrics-generator-prometheus-agent-wal branch from ea81280 to afa9c01 Compare March 7, 2022 13:07

yvrhdn commented Mar 7, 2022

View reviewed changes

Linting

fd09d60

yvrhdn marked this pull request as ready for review March 7, 2022 13:45

yvrhdn requested review from joe-elliott, annanay25, mdisibio, dgzlopes, mapno and zalegrala as code owners March 7, 2022 13:45

yvrhdn mentioned this pull request Mar 7, 2022

Metrics-generator production readiness #1303

Closed

26 tasks

mapno reviewed Mar 8, 2022

View reviewed changes

integration/e2e/metrics-generator/metrics_generator_test.go Outdated Show resolved Hide resolved

modules/generator/storage/instance.go Outdated Show resolved Hide resolved

Koenraad Verheyden added 3 commits March 8, 2022 17:49

Move e2e test together with the others

80bebff

Simplify

0ea45ed

Improve errors from TestInstance_multiTenancy

eebd4fc

mapno approved these changes Mar 8, 2022

View reviewed changes

Koenraad Verheyden added 4 commits March 8, 2022 19:52

Make TestInstance more robust

df73da2

gofmt

be4a98e

Update config docker-compose example

b19971a

Data races 😬

ac6eb74

yvrhdn merged commit 7894f59 into grafana:main Mar 9, 2022

yvrhdn deleted the kvrhdn/metrics-generator-prometheus-agent-wal branch March 9, 2022 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics-generator: use Prometheus Agent WAL and remote storage #1323

metrics-generator: use Prometheus Agent WAL and remote storage #1323

yvrhdn commented Mar 4, 2022 •

edited

Loading

yvrhdn commented Mar 4, 2022 •

edited

Loading

yvrhdn Mar 7, 2022

mapno left a comment

yvrhdn commented Mar 8, 2022

metrics-generator: use Prometheus Agent WAL and remote storage #1323

metrics-generator: use Prometheus Agent WAL and remote storage #1323

Conversation

yvrhdn commented Mar 4, 2022 • edited Loading

yvrhdn commented Mar 4, 2022 • edited Loading

yvrhdn Mar 7, 2022

Choose a reason for hiding this comment

mapno left a comment

Choose a reason for hiding this comment

yvrhdn commented Mar 8, 2022

yvrhdn commented Mar 4, 2022 •

edited

Loading

yvrhdn commented Mar 4, 2022 •

edited

Loading