From b03fe38cac4e9de288765eb8a0ed391cec4a89fe Mon Sep 17 00:00:00 2001
From: Richard Towers <richard.towers@digital.cabinet-office.gov.uk>
Date: Wed, 13 Sep 2023 10:19:49 +0100
Subject: [PATCH] Default to Prometheus histograms, not summaries

Prometheus histograms / summaries are complex and hard to wrap one's
head around. The difference between them is also quite subtle and
confusing to the uninitiated.  I'm sure I won't do a good job of
explaining the difference in this commit message, so I'll just link to
the docs[0].

The key difference for us is that summaries can't be aggregates, but
histograms can.

For example, if I'm looking at http_request_duration_seconds, we might
want to know "what's the 95%ile request time for this controller
action". We can answer this question for a particular application
instance (i.e. a specific pod) with both histograms and summaries. If we
want to know the 95%ile request time aggregated across all instances /
pods however, we can only do that with histograms. This is because in
summaries, quantiles are computed ahead of time by the prometheus
client, so they can only see information for one particular app
instance. Histograms on the other hand, defer the calculation of
quantiles to query time, which means they can be aggregated (but are
less precise).

prometheus_exporter defaults to Summary[1], but in our case I think it
makes more sense to default to Histogram. There may be some apps where
we prefer Summary, so I've allowed it to be passed in as a configuration
option.

From the summary metrics we have at the moment, we can see that some
controller actions take significantly longer than the 10 seconds which
prometheus_exporter uses as it's default max bucket. Therefore, I've
added a few more buckets so we can see the distribution between 10 and
50 seconds.

[0] - https://prometheus.io/docs/practices/histograms/#quantiles
[1] - https://github.com/discourse/prometheus_exporter#histogram-mode
---
 lib/govuk_app_config/govuk_prometheus_exporter.rb | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/govuk_app_config/govuk_prometheus_exporter.rb b/lib/govuk_app_config/govuk_prometheus_exporter.rb
index a452561..0e2be64 100644
--- a/lib/govuk_app_config/govuk_prometheus_exporter.rb
+++ b/lib/govuk_app_config/govuk_prometheus_exporter.rb
@@ -11,13 +11,19 @@ def self.should_configure
     end
   end
 
-  def self.configure(collectors: [])
+  def self.configure(collectors: [], default_aggregation: Prometheus::Metric::Histogram)
     return unless should_configure
 
     require "prometheus_exporter"
     require "prometheus_exporter/server"
     require "prometheus_exporter/middleware"
 
+    # PrometheusExporter::Metric::Histogram.DEFAULT_BUCKETS tops out at 10 but
+    # we have a few controller actions which are slower than this, so we add a
+    # few extra buckets for slower requests
+    PrometheusExporter::Metric::Histogram.default_buckets = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 25, 50].freeze
+    PrometheusExporter::Metric::Base.default_aggregation = default_aggregation
+
     if defined?(Sidekiq)
       Sidekiq.configure_server do |config|
         require "sidekiq/api"