From b03fe38cac4e9de288765eb8a0ed391cec4a89fe Mon Sep 17 00:00:00 2001 From: Richard Towers Date: Wed, 13 Sep 2023 10:19:49 +0100 Subject: [PATCH] Default to Prometheus histograms, not summaries Prometheus histograms / summaries are complex and hard to wrap one's head around. The difference between them is also quite subtle and confusing to the uninitiated. I'm sure I won't do a good job of explaining the difference in this commit message, so I'll just link to the docs[0]. The key difference for us is that summaries can't be aggregates, but histograms can. For example, if I'm looking at http_request_duration_seconds, we might want to know "what's the 95%ile request time for this controller action". We can answer this question for a particular application instance (i.e. a specific pod) with both histograms and summaries. If we want to know the 95%ile request time aggregated across all instances / pods however, we can only do that with histograms. This is because in summaries, quantiles are computed ahead of time by the prometheus client, so they can only see information for one particular app instance. Histograms on the other hand, defer the calculation of quantiles to query time, which means they can be aggregated (but are less precise). prometheus_exporter defaults to Summary[1], but in our case I think it makes more sense to default to Histogram. There may be some apps where we prefer Summary, so I've allowed it to be passed in as a configuration option. From the summary metrics we have at the moment, we can see that some controller actions take significantly longer than the 10 seconds which prometheus_exporter uses as it's default max bucket. Therefore, I've added a few more buckets so we can see the distribution between 10 and 50 seconds. [0] - https://prometheus.io/docs/practices/histograms/#quantiles [1] - https://github.com/discourse/prometheus_exporter#histogram-mode --- lib/govuk_app_config/govuk_prometheus_exporter.rb | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/lib/govuk_app_config/govuk_prometheus_exporter.rb b/lib/govuk_app_config/govuk_prometheus_exporter.rb index a452561..0e2be64 100644 --- a/lib/govuk_app_config/govuk_prometheus_exporter.rb +++ b/lib/govuk_app_config/govuk_prometheus_exporter.rb @@ -11,13 +11,19 @@ def self.should_configure end end - def self.configure(collectors: []) + def self.configure(collectors: [], default_aggregation: Prometheus::Metric::Histogram) return unless should_configure require "prometheus_exporter" require "prometheus_exporter/server" require "prometheus_exporter/middleware" + # PrometheusExporter::Metric::Histogram.DEFAULT_BUCKETS tops out at 10 but + # we have a few controller actions which are slower than this, so we add a + # few extra buckets for slower requests + PrometheusExporter::Metric::Histogram.default_buckets = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 25, 50].freeze + PrometheusExporter::Metric::Base.default_aggregation = default_aggregation + if defined?(Sidekiq) Sidekiq.configure_server do |config| require "sidekiq/api"