Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Many prometheus metrics are duplicated #11106

Closed
richvdh opened this issue Oct 18, 2021 · 5 comments
Closed

Many prometheus metrics are duplicated #11106

richvdh opened this issue Oct 18, 2021 · 5 comments
Assignees
Labels
A-Metrics metrics, measures, stuff we put in Prometheus O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@richvdh
Copy link
Member

richvdh commented Oct 18, 2021

A number of the reported Prometheus metrics are duplicated. This can be a problem for Prometheus instances configured to monitor a large number of Synapse instances.

Examples:

  • we have two copies of each counter, one with and one without a _total suffix
  • synapse_util_caches_cache:hits is duplicated as synapse_util_caches_cache_hits.
@DMRobertson DMRobertson added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Oct 18, 2021
@MadLittleMods MadLittleMods added A-Metrics metrics, measures, stuff we put in Prometheus z-maintenance labels Nov 10, 2021
@DMRobertson DMRobertson added this to the Server Density milestone Jul 21, 2022
@reivilibre
Copy link
Contributor

Which ones would we want to remove?

We use synapse_util_caches_cache:hits on the dashboard so I'd be in favour of removing synapse_util_caches_cache_hits (N.B. no hits in the dashboard JSON).

Equally, we don't use many _total-suffixed counters, so I'd be in favour of removing those.

Though I will bring to your attention that there are a few where we do:

rei@lithium ~/work/synapse/contrib $ rg _total grafana/synapse.json 
430:          "expr": "rate(process_cpu_seconds_total{instance=\"$instance\",job=~\"$job\",index=~\"$index\"}[$bucket_size])",
797:              "expr": "rate(process_cpu_system_seconds_total{instance=\"$instance\",job=~\"$job\",index=~\"$index\"}[$bucket_size])",
806:              "expr": "rate(process_cpu_user_seconds_total{instance=\"$instance\",job=~\"$job\",index=~\"$index\"}[$bucket_size])",
4668:              "expr": "sum(rate(synapse_federation_soft_failed_events_total{instance=\"$instance\"}[$bucket_size]))",
7436:              "expr": "rate(python_gc_unreachable_total{instance=\"$instance\",job=~\"$job\",index=~\"$index\"}[$bucket_size])/rate(python_gc_time_count{instance=\"$instance\",job=~\"$job\",index=~\"$index\"}[$bucket_size])",
8124:              "expr": "synapse_replication_tcp_resource_total_connections{job=~\"$job\",index=~\"$index\",instance=\"$instance\"}",

A good first step may be to correct our dashboard JSON in preparation for the change.

I also imagine this would want to be communicated as a breaking change.

Suggested steps as follows:

  1. Fix dashboard JSON to not use _total at all
  2. Remove *_total and synapse_util_caches_cache_* from the published metrics.

@richvdh
Copy link
Member Author

richvdh commented Jul 28, 2022

Equally, we don't use many _total-suffixed counters, so I'd be in favour of removing those.

This is a bit complicated. Let me try and give some history:

There is a standard called OpenMetrics, which mandates that the names of counters end in _total (see https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#counter-1). Back in prometheus-client 0.4.0, the Prometheus maintainers changed prometheus-client to automatically add _total to the names of any counters that don't already have it, for compliance with OpenMetrics (see prometheus/client_python#300).

Of course, that broke everything for us (cf prometheus/client_python#317), and we worked around it by writing a custom exposition.py which basically forks a bunch of prometheus-client to expose both the _total and non-_total variants. Our intention was that we would get rid of the non-_total variants soon after, but here we are four years on.

Sooo... where does that leave us? There's no real reason we have to use the OpenMetrics-compliant names (the prometheus server seems happy either way), but it would still be quite nice to get rid of that custom exposition.py (and future versions of prometheus, or other OpenMetrics server implementations, might be less tolerant). So basically: I'd suggest switching to the _total variants if possible.

@richvdh
Copy link
Member Author

richvdh commented Jul 28, 2022

We use synapse_util_caches_cache:hits on the dashboard so I'd be in favour of removing synapse_util_caches_cache_hits (N.B. no hits in the dashboard JSON).

According to https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels, we're not supposed to use colons:

Note: The colons are reserved for user defined recording rules. They should not be used by exporters or direct instrumentation.

@reivilibre
Copy link
Contributor

reivilibre commented Aug 16, 2022

New steps:

Timeline

  • v1.68.0 enable_legacy_metrics: false available
  • v1.71.0 Legacy metrics disabled by default
  • v1.73.0 Legacy metrics removed altogether

@reivilibre reivilibre self-assigned this Aug 17, 2022
@DMRobertson DMRobertson added S-Minor Blocks non-critical functionality, workarounds exist. O-Uncommon Most users are unlikely to come across this or unexpected workflow and removed z-maintenance labels Aug 25, 2022
@reivilibre
Copy link
Contributor

Finished up in #14358 ... as far as I'm aware!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Metrics metrics, measures, stuff we put in Prometheus O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

4 participants