Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics-generator production readiness #1303

Closed
13 of 26 tasks
mapno opened this issue Feb 23, 2022 · 3 comments
Closed
13 of 26 tasks

Metrics-generator production readiness #1303

mapno opened this issue Feb 23, 2022 · 3 comments

Comments

@mapno
Copy link
Member

mapno commented Feb 23, 2022

A new component metrics-generator has been recently added to Tempo in #1282. But it's not considered production ready yet. This issue aims to list the additional work that is needed before its official release.

The first implementation is based on design document #1206. The design document is, at the time of writing, open and we encourage everyone to take a look.

All feedback is highly appreciated. Feel free to do so in this issue, by pinging us (@kvrhdn and @mapno) at Slack, or by opening PRs with changes and proposals.

TODOs

The following list contains all tasks that we aim to complete before the component's official release. Tasks are divided by topic or area within Tempo.

Distributor

  • Include metrics-generator in distributor tests
  • Improve how we stream data from distributor -> metrics-generator

Trace format

  • Use a custom trace format (so not OTLP), but something more lightweight that contains minimal info. We should benchmark and compare performance.

Metrics generator

  • Make processors configurable per tenant and reload config at run-time (using overrides)
  • We might be ingesting old spans (= took a long time to get to Tempo), we should filter them out or count these correctly

Spanmetrics processor

  • Verify correctness, compare with agent / OTel processor
  • Ensure dimensions/labels map to valid Prometheus labels

Service graphs processor

Metrics registry

  • Deal with inactive/stale series: if a metrics-generator instance is running for multiple days, the amount of samples it will write per request will continuously grow. We should trim this now and then by removing time series that have not been updated recently.
  • Enforce max active series limits per tenant
  • Make collection interval configurable per tenant
  • Support exemplars

Remote write

Operations

  • Add a dedicated operational dashboard to tempo-mixin
  • Add a docker-compose example
  • Update the helm charts
    • tempo single-binary
    • tempo distributed

Documentation

  • Include metrics-generator on configuration page
  • Document metrics-generator ring on ring page
  • Reconcile service graphs and spanmetrics docs between Tempo and the Agent
  • Operational information
    • Describe how to deploy it and what tradeoffs there are
    • Describe what metrics to watch
@yvrhdn
Copy link
Member

yvrhdn commented Mar 7, 2022

Bug 🐛 : collections/remote writes doubles after a time

Observed this in one of our clusters: after a while the amount of collections doubles or even triples. I'm guessing we are somehow spawning extra goroutines that collect metrics. Workaround: restart the instance 🤷🏻

These instances are configured with collection_interval: 15s, so we expect 4 remote writes / minute.
Screenshot 2022-03-06 at 04 44 50

To properly solve this we should probably redesign the registry so we can get collects reliable and independent per tenant.

@yvrhdn
Copy link
Member

yvrhdn commented Mar 7, 2022

Bug 🐛 : single-binary mode crashes due to duplicate metrics registration

The metrics-generator and the compactor use the dskit BasicLifecycler. When run in single-binary modus, both lifecycler run in the same process and register the same metrics twice.
The following calls are causing the issue: ring.NewBasicLifecycler and ring.NewWithStoreClientAndStrategy.

We should tweak the dskit ring to ensure the metrics only get registered once: grafana/dskit#133

panic: duplicate metrics collector registration attempted
goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*wrappingRegisterer).MustRegister(0x4000492570, {0x40003ff9b0, 0x1, 0x1})
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/prometheus/client_golang/prometheus/wrap.go:104 +0x18c
github.com/prometheus/client_golang/prometheus/promauto.Factory.NewGaugeVec(
{{0x1e19ec0, 0x4000492570}}, {{0x0, 0x0}, {0x0, 0x0}, {0x19590e6, 0x1d}, {0x197fa60, 0x2b}, ...}, ...)
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/prometheus/client_golang/prometheus/promauto/auto.go:308 +0xd8
github.com/grafana/dskit/ring.NewWithStoreClientAndStrategy({{{0x1930457, 0x8}, {0x1934841, 0xb}, {{{...}, {...}, 0x4a817c800, 0x0, 0x3ff0000000000000, 0x1, ...}, ...}, ...}, ...}, ...)
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/grafana/dskit/ring/ring.go:228 +0x158
github.com/grafana/dskit/ring.New({{{0x1930457, 0x8}, {0x1934841, 0xb}, {{{...}, {...}, 0x4a817c800, 0x0, 0x3ff0000000000000, 0x1, ...}, ...}, ...}, ...}, ...)
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/grafana/dskit/ring/ring.go:213 +0x28c
github.com/grafana/tempo/pkg/ring.New({{{0x1930457, 0x8}, {0x1934841, 0xb}, {{{...}, {...}, 0x4a817c800, 0x0, 0x3ff0000000000000, 0x1, ...}, ...}, ...}, ...}, ...)
/Users/koenraad/Repositories/grafana/tempo/pkg/ring/ring.go:23 +0x170
github.com/grafana/tempo/cmd/tempo/app.(*App).initGeneratorRing(0x40006df500)
/Users/koenraad/Repositories/grafana/tempo/cmd/tempo/app/modules.go:93 +0xb0
github.com/grafana/dskit/modules.(*Manager).initModule(0x4000203638, {0x400011edb0, 0x3}, 0x400071d630, 0x4000194c60)
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/grafana/dskit/modules/modules.go:106 +0x250
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0x4000203638, {0x400071d7e8, 0x1, 0x1})
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/grafana/dskit/modules/modules.go:78 +0xe4
github.com/grafana/tempo/cmd/tempo/app.(*App).Run(0x40006df500)
/Users/koenraad/Repositories/grafana/tempo/cmd/tempo/app/app.go:261 +0x1f4
main.main()
/Users/koenraad/Repositories/grafana/tempo/cmd/tempo/main.go:108 +0xa18

@mapno mapno mentioned this issue Apr 25, 2022
3 tasks
@mapno
Copy link
Member Author

mapno commented Dec 14, 2022

I think we can close this issue as we're not following this list. Feel free to re-open or tackle any of the items still left to be completed.

@mapno mapno closed this as completed Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants