Reduce cardinality of pg_stat_statements #765

SuperQ · 2023-03-06T08:36:54Z

Make the example queries.yaml pg_stat_statements query safer.

Select the top 10% of queries by total query time.
Only expose the top 100 queries by total query time.
Keep only the most useful metrics.
Comment out the example by default.

Fixes: #549

SuperQ · 2023-03-06T08:37:21Z

👓 @benjimin

This will make the default example much safer for production use.

benjimin · 2023-03-07T01:02:02Z

I do not think this fixes #549 and would like to suggest another approach.

This commit literally increases the cardinality further instead of reducing it:

Not only does it still multiply the number of separate time series by the number of distinct queries, but now on top of that it multiplies it by the number of users.

Including LIMIT 100 does not really help the problem:

It hides the problem a bit, since the number of metrics per scrape will be capped.
But each new scrape will present different labels, so the total number of separate (and recently-active) time-series stored by prometheus will still be astronomical.
This will still risk significant financial impacts for users of services (like Grafana Cloud) which bill by recently active time-series count.
The fact that most of these time series will contain terribly few sample points is merely symptomatic of usage unsuited to prometheus (and violating the admonition of the prometheus documentation not to apply labels with unbounded value sets, already cited in Remove QueryID to fix #549 #761).

@SuperQ if you're intent on using queryid there are ways this might fit better with the prometheus data model, say:

A counter metric that shows select count(distinct queryid). (This could be employed downstream to determine the rate at which new queries are introduced, or the proportion of queries that get repeated without modification.)
A guage metric that shows the most recent query id. (Samples of this gauge time-series could be employed downstream to visualise the frequency distribution of different queries.)

Also, just some general comments:

It's perfectly legitimate to periodically emit measurements with high-cardinality labels (like queryid and userid) in logs, just not in prometheus metrics.
I find it reasonable to expect most users will follow the instructions (particularly README-RDS.md if they are using AWS RDS) and will not read (nor fork) the code for themselves nor second-guess the authors combined expertise in prometheus, postgres, and proper usage of this exporter. If the concern won't be eliminated then it would be helpful to warn these users with some mention in the markdown documentation files.
I think the defaults should be suited for production databases with heavy usage. (A database with only one or two users, that is only ever used to repeat a handful of distinct queries, likely has much less need for metrics instrumentation anyway.)
I don't see harm in commenting the example out completely if it is problematic for many users.

SuperQ · 2023-03-07T08:07:41Z

This configuration is taken from gitlab.com production. It is well tested and reasonable.

If you don't want to use the use the query, you are free to remove it from the config file. Users with artificial constraints on series have to manage this. It's not the job of the exporters to do this for them. The goal of the exporters is to provide useful information.

The whole point of this example is to expose per-query metrics. This is perfectly reasonable to do in many situations and provides valuable insight into applications.

SuperQ · 2023-03-07T08:14:15Z

A database with only one or two users, that is only ever used to repeat a handful of distinct queries, likely has much less need for metrics instrumentation anyway

This is actually the opposite, this is actually the primary use case for this. Production databases with a high rate of queries where you need to track per-queryid efficiency trends over time.

Most production uses have fairly stable queryid patterns, because the queryid is normalized without the parameters to the query. So SELECT * from users where user_id=XXXX is always the same queryid. So they contain a high amount of samples and are very appropriate for metrics use cases.

Make the example queries.yaml `pg_stat_statements` query safer. * Select the top 10% of queries by total query time. * Only expose the top 100 queries by total query time. * Keep only the most useful metrics. * Comment out the example by default. Fixes: #549 Signed-off-by: SuperQ <superq@gmail.com>

SuperQ · 2023-03-30T07:29:36Z

Ping @sysadmind

sysadmind

I think this change is fine. If users want something different this file is free for users to edit in their own deployments.

One thing I notice here is dropping the check for 'rdsadmin'. I think that's safe because we're no longer joining to pg_roles, but you seem more familiar with this code so just double check that this won't break RDS please.

SuperQ · 2023-04-13T05:30:59Z

Yes, I think we simplified things to not need the join.

SuperQ requested a review from sysadmind March 6, 2023 08:36

SuperQ force-pushed the superq/stat_statements branch 2 times, most recently from d01fb17 to 659f474 Compare March 6, 2023 09:28

SuperQ force-pushed the superq/stat_statements branch from 659f474 to e7f58a4 Compare March 7, 2023 08:31

sysadmind approved these changes Apr 13, 2023

View reviewed changes

SuperQ merged commit 994e5cc into master Apr 13, 2023

SuperQ deleted the superq/stat_statements branch April 13, 2023 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce cardinality of pg_stat_statements #765

Reduce cardinality of pg_stat_statements #765

SuperQ commented Mar 6, 2023 •

edited

Loading

SuperQ commented Mar 6, 2023 •

edited

Loading

benjimin commented Mar 7, 2023

SuperQ commented Mar 7, 2023

SuperQ commented Mar 7, 2023 •

edited

Loading

SuperQ commented Mar 30, 2023

sysadmind left a comment

SuperQ commented Apr 13, 2023

Reduce cardinality of pg_stat_statements #765

Reduce cardinality of pg_stat_statements #765

Conversation

SuperQ commented Mar 6, 2023 • edited Loading

SuperQ commented Mar 6, 2023 • edited Loading

benjimin commented Mar 7, 2023

SuperQ commented Mar 7, 2023

SuperQ commented Mar 7, 2023 • edited Loading

SuperQ commented Mar 30, 2023

sysadmind left a comment

Choose a reason for hiding this comment

SuperQ commented Apr 13, 2023

SuperQ commented Mar 6, 2023 •

edited

Loading

SuperQ commented Mar 6, 2023 •

edited

Loading

SuperQ commented Mar 7, 2023 •

edited

Loading