Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics-generator: use Prometheus Agent WAL and remote storage #1323

Merged

Conversation

yvrhdn
Copy link
Member

@yvrhdn yvrhdn commented Mar 4, 2022

What this PR does:
Replaces our homegrown remote write implementation with the WAL and remote storage implementation from the Prometheus Agent.

We use two components from Prometheus:

  • agent.DB: this is the WAL optimised for remote writing metrics only. It's a Prometheus TSDB without the querying, alerting,... capabilities. Whenever we scrape/collect metrics we append them to agent.DB which will store the samples on disk.
  • remote.Storage (only the remote write functionality): this tails the WAL and writes data to the configured remote write endpoint(s). It has retry logic and can scale up queues as necessary. It supports a wide range of authorization options and can rewrite labels, see https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write

Multi-tenancy: Prometheus is not multi-tenant, to support multi-tenancy we create a WAL and remote-writer for every tenant sending data. Each WAL will be stored in <WAL path>/<tenant ID>.

Resilience: even though the WAL stores samples on disk, it does not make the metrics-generator resilient to crashes. After a restart the remote writer will start from the end of the WAL, even if older data was sent yet. See prometheus/prometheus#8809.
But it will make the metrics-generator more resilient against an outage of the downstream TSDB. Pending samples is stored on disk so we do not lose them or risk running out of memory. This should allow the metrics-generator to overcome short outages.

Which issue(s) this PR fixes:
Related to #1303

Checklist

  • Tests updated
  • Documentation added Will be done in a later PR
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@yvrhdn
Copy link
Member Author

yvrhdn commented Mar 4, 2022

TODO

  • Fix existing tests
  • Add e2e test
  • Verify config works as expected

@yvrhdn yvrhdn force-pushed the kvrhdn/metrics-generator-prometheus-agent-wal branch from ea81280 to afa9c01 Compare March 7, 2022 13:07
val float64
}

func TestGenerator(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deleted these tests in favour of the e2e tests. These tests have to do a lot of config management/mocking to get the generator running and verify the metrics emitted are correctly. This is exactly the same as what the e2e tests do, but their code is a bit simpler.

If anyone feels strong about keeping them, I don't mind reinstating them.

@yvrhdn yvrhdn marked this pull request as ready for review March 7, 2022 13:45
@yvrhdn yvrhdn mentioned this pull request Mar 7, 2022
26 tasks
Copy link
Member

@mapno mapno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Love this changes. Left just two nits. LGTM

modules/generator/storage/instance.go Outdated Show resolved Hide resolved
@yvrhdn
Copy link
Member Author

yvrhdn commented Mar 8, 2022

Hmm, TestInstance_multiTenancy seems to be flaky in CI. I'll see if I can make it more robust.

@yvrhdn yvrhdn merged commit 7894f59 into grafana:main Mar 9, 2022
@yvrhdn yvrhdn deleted the kvrhdn/metrics-generator-prometheus-agent-wal branch March 9, 2022 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants