Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce cardinality of pg_stat_statements #765

Merged
merged 1 commit into from
Apr 13, 2023
Merged

Conversation

SuperQ
Copy link
Contributor

@SuperQ SuperQ commented Mar 6, 2023

Make the example queries.yaml pg_stat_statements query safer.

  • Select the top 10% of queries by total query time.
  • Only expose the top 100 queries by total query time.
  • Keep only the most useful metrics.
  • Comment out the example by default.

Fixes: #549

@SuperQ SuperQ requested a review from sysadmind March 6, 2023 08:36
@SuperQ
Copy link
Contributor Author

SuperQ commented Mar 6, 2023

👓 @benjimin

This will make the default example much safer for production use.

@SuperQ SuperQ force-pushed the superq/stat_statements branch 2 times, most recently from d01fb17 to 659f474 Compare March 6, 2023 09:28
@benjimin
Copy link

benjimin commented Mar 7, 2023

I do not think this fixes #549 and would like to suggest another approach.

This commit literally increases the cardinality further instead of reducing it:

  • Not only does it still multiply the number of separate time series by the number of distinct queries, but now on top of that it multiplies it by the number of users.

Including LIMIT 100 does not really help the problem:

  • It hides the problem a bit, since the number of metrics per scrape will be capped.
  • But each new scrape will present different labels, so the total number of separate (and recently-active) time-series stored by prometheus will still be astronomical.
    This will still risk significant financial impacts for users of services (like Grafana Cloud) which bill by recently active time-series count.
  • The fact that most of these time series will contain terribly few sample points is merely symptomatic of usage unsuited to prometheus (and violating the admonition of the prometheus documentation not to apply labels with unbounded value sets, already cited in Remove QueryID to fix #549 #761).

@SuperQ if you're intent on using queryid there are ways this might fit better with the prometheus data model, say:

  • A counter metric that shows select count(distinct queryid). (This could be employed downstream to determine the rate at which new queries are introduced, or the proportion of queries that get repeated without modification.)
  • A guage metric that shows the most recent query id. (Samples of this gauge time-series could be employed downstream to visualise the frequency distribution of different queries.)

Also, just some general comments:

  • It's perfectly legitimate to periodically emit measurements with high-cardinality labels (like queryid and userid) in logs, just not in prometheus metrics.
  • I find it reasonable to expect most users will follow the instructions (particularly README-RDS.md if they are using AWS RDS) and will not read (nor fork) the code for themselves nor second-guess the authors combined expertise in prometheus, postgres, and proper usage of this exporter. If the concern won't be eliminated then it would be helpful to warn these users with some mention in the markdown documentation files.
  • I think the defaults should be suited for production databases with heavy usage. (A database with only one or two users, that is only ever used to repeat a handful of distinct queries, likely has much less need for metrics instrumentation anyway.)
  • I don't see harm in commenting the example out completely if it is problematic for many users.

@SuperQ
Copy link
Contributor Author

SuperQ commented Mar 7, 2023

This configuration is taken from gitlab.com production. It is well tested and reasonable.

If you don't want to use the use the query, you are free to remove it from the config file. Users with artificial constraints on series have to manage this. It's not the job of the exporters to do this for them. The goal of the exporters is to provide useful information.

The whole point of this example is to expose per-query metrics. This is perfectly reasonable to do in many situations and provides valuable insight into applications.

@SuperQ
Copy link
Contributor Author

SuperQ commented Mar 7, 2023

A database with only one or two users, that is only ever used to repeat a handful of distinct queries, likely has much less need for metrics instrumentation anyway

This is actually the opposite, this is actually the primary use case for this. Production databases with a high rate of queries where you need to track per-queryid efficiency trends over time.

Most production uses have fairly stable queryid patterns, because the queryid is normalized without the parameters to the query. So SELECT * from users where user_id=XXXX is always the same queryid. So they contain a high amount of samples and are very appropriate for metrics use cases.

Make the example queries.yaml `pg_stat_statements` query safer.
* Select the top 10% of queries by total query time.
* Only expose the top 100 queries by total query time.
* Keep only the most useful metrics.
* Comment out the example by default.

Fixes: #549

Signed-off-by: SuperQ <superq@gmail.com>
@SuperQ SuperQ force-pushed the superq/stat_statements branch from 659f474 to e7f58a4 Compare March 7, 2023 08:31
@SuperQ
Copy link
Contributor Author

SuperQ commented Mar 30, 2023

Ping @sysadmind

Copy link
Contributor

@sysadmind sysadmind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is fine. If users want something different this file is free for users to edit in their own deployments.

One thing I notice here is dropping the check for 'rdsadmin'. I think that's safe because we're no longer joining to pg_roles, but you seem more familiar with this code so just double check that this won't break RDS please.

@SuperQ
Copy link
Contributor Author

SuperQ commented Apr 13, 2023

Yes, I think we simplified things to not need the join.

@SuperQ SuperQ merged commit 994e5cc into master Apr 13, 2023
@SuperQ SuperQ deleted the superq/stat_statements branch April 13, 2023 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pg_stat_statement metrics have unreasonable cardinality
3 participants