Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics are not aggregated when in clustered mode #15

Closed
botimer opened this issue Feb 5, 2021 · 1 comment · Fixed by #16
Closed

Metrics are not aggregated when in clustered mode #15

botimer opened this issue Feb 5, 2021 · 1 comment · Fixed by #16

Comments

@botimer
Copy link
Contributor

botimer commented Feb 5, 2021

TLDR: all of the puma gauges should use aggregation: :most_recent to be accurate.


When using the DirectFileStore to support multi-process mode (aka clustered, multiple workers), the pid labels included by default lead to incorrect and stale information. This is related to #8, yabeda-prometheus#10, and others.

As-is, each process reports info about all processes, giving some multiplicative data, and workers that are terminated are not pruned from the metrics (at least until a total restart cleans all of the bin files). As an example, say you run with 4 workers and 8 threads, and three have crashed/restarted. With the pid labels, these will appear as 8x4 threads for every worker, giving 7x32 (4 live workers, and 3 expired), or 224 threads, rather than the actual 32.

The main issue is that prometheus-client cannot predict the correct type of aggregation for gauge metrics, so they do the safe default of including the metric with a pid label. In the clustered puma scenario, this is not helpful, as each process gets all of the data, yielding some cross-talk. The pid is volatile, as well as irrelevant. What we want is only the index label to identify each worker and to export only the most recent data.

Fortunately, the design of prometheus-client allows these aggregation settings to pass through and no-op for the Synchronized and SingleThreaded stores, and the MOST_RECENT aggregation is appropriate for all 7 of the gauges. This makes the patch simple and unconditional.

@Envek Envek closed this as completed in #16 Feb 5, 2021
Envek pushed a commit that referenced this issue Feb 5, 2021
This change updates all of the gauges to include "most recent" aggregation mode introduced in prometheus-client 2.1.0. This is safe for both single- and multi-process mode because only the DirectFileStore from prometheus-client handles the aggregation option -- there is no aggregation to be done in single-process mode or with other adapters.

Fixes #15
@Envek
Copy link
Member

Envek commented Feb 5, 2021

Wow! Thank you for very detailed explanation!

My bad, I wasn't tracking Prometheus Ruby client development closely and missed addition of new :most_recent aggregation mode in version 2.1.0.

I believe that it should fix #8

Please try version 0.6.0 of this gem that has your PR merged and share your experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants