Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PromQL expression QOL #31

Open
kedoodle opened this issue Jan 4, 2022 · 8 comments
Open

PromQL expression QOL #31

kedoodle opened this issue Jan 4, 2022 · 8 comments

Comments

@kedoodle
Copy link
Contributor

kedoodle commented Jan 4, 2022

We're updating our alerts to make use of the metrics exposed by the prometheus-exporter feature of aws-quota-checker. We have a generic expression which aims to alert whenever we've breached 70% of any limit.

The expression is quite unweildly:

round( 100 *
    label_replace({__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+(_count|_instances))$")
    / on (resource)
    label_replace({__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+)(_limit)$")
) > 70

A couple suggestions which would aid in crafting PromQL expressions:

  • It would be great if the metrics had an additional label e.g.
    awsquota_rds_instances{resource="rds_instances"}
    awsquota_rds_instances_limit{resource="rds_instances"}
    
  • A bigger change, but what if the metrics were exposed using the same pair of metrics, just with the additional label as above? e.g.
    awsquota_usage{resource="rds_instances"}
    awsquota_limit{resource="rds_instances"}
    

Feel free to disregard if this is too niche or opinionated in a direction you'd rather not take. A solution to those facing similar grievances could be through the use of recording rules.

@brennerm
Copy link
Owner

brennerm commented Jan 4, 2022

Hey @kedoodle, thanks for opening this issue. I understand the benefit of switching to the awsquota_usage{resource="rds_instances"} scheme. But what would be the advantage of adding a resource label to the existing metrics?

@kedoodle
Copy link
Contributor Author

kedoodle commented Jan 4, 2022

Hey @brennerm, appreciate the response!

I'm thinking of a scenario for "generic" expressions where we want to alert on any and all AWS limits reaching a certain threshold (as opposed to a singular resource).

TL;DR: it saves a label_replace or two.

Existing metrics:

awsquota_s3_bucket_count{account="123456789012"}
awsquota_s3_bucket_count_limit{account="123456789012"}

Existing expression (same as original issue comment):

round( 100 *
    label_replace({__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+(_count|_instances))$")
    /
    label_replace({__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+)(_limit)$")
) > 70

Existing metric names with additional resource label:

awsquota_s3_bucket_count{account="123456789012",resource="s3_bucket_count"}
awsquota_s3_bucket_count_limit{account="123456789012",resource="s3_bucket_count"}

New expression with existing metric names with additional resource label:

round( 100 *
    {__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}
    / on (resource)
    {__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}
) > 70

It could also be nice for specific alerts where you want to use the resource as part of the alert details e.g. the alert could have a description (using metric labels) that we have reached 70% of the limit on s3_bucket_count in 123456789012. I understand that you can get the resource from the metric name - it just requires an extra label_replace for a seemingly common use-case.

round( 100 *
    {__name__="awsquota_s3_bucket_count"}
    / on (resource)
    {__name__="awsquota_s3_bucket_count_limit"}
) > 70

@brennerm
Copy link
Owner

brennerm commented Jan 5, 2022

@kedoodle I agree with your point of view. I added a new label called quota in 585f1b6 that contains the quota name.
Could you provide feedback on that change? If it works for you I'll create a new release.

I'll probably also switch to the proposed awsquota_usage and awsquota_limit scheme at some point in time but that'll be part of a new major release as it's a breaking change.

@kedoodle
Copy link
Contributor Author

kedoodle commented Jan 5, 2022

Hey @brennerm, I've built and deployed from 585f1b6. The new label looks great!

image

image

I understand with awsquota_usage and awsquota_limit being a breaking change. Would love to see it in a future release.

@brennerm
Copy link
Owner

brennerm commented Jan 6, 2022

That's great to hear. The change has been released with version 1.10.0.

I'll leave the ticket open until I switch to the breaking change scheme.

@kedoodle
Copy link
Contributor Author

kedoodle commented Jan 7, 2022

Thanks @brennerm!

I'm in the process of deploying 1.10.0 into a few different k8s clusters. Probably unrelated to #31, but I'm seeing some high spikes in memory (~800 MiB) usage during refreshing current values. I've increased memory limits and will let you know (in another issue?) next week if the spikes persisted over the weekend.

Container logs, after which the pod is OOMKilled:

AWS profile: default | AWS region: ap-southeast-2 | Active checks: cf_stack_count,ebs_snapshot_count,rds_instances,s3_bucket_count
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - starting /metrics endpoint on port 8080
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - collecting checks
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - collected 4 checks
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - refreshing limits
07-Jan-22 04:46:34 [INFO] aws_quota.prometheus - limits refreshed
07-Jan-22 04:46:34 [INFO] aws_quota.prometheus - refreshing current values

EDIT:
Given enough memory, we can see it takes 3 minutes 30 seconds to refresh current values:

07-Jan-22 05:04:06 [INFO] aws_quota.prometheus - refreshing current values
07-Jan-22 05:07:36 [INFO] aws_quota.prometheus - current values refreshed

This particular AWS account has ~35k EBS snapshots. I suspect pagination may be needed to reduce memory usage during any one particular check e.g. https://github.com/brennerm/aws-quota-checker/blob/1.10.0/aws_quota/check/ebs.py#L13 for my scenario.

EDIT 2:
Did some troubleshooting given that most people probably don't have an AWS account with 35k EBS snapshots handy. PR opened #32.

@tpoindessous
Copy link

Hello @kedoodle

thanks for your work. Your expression doesn't work with this metric

awsquota_elb_listeners_per_clb

We are trying to find a new alert rule, we will get back to you !

Thanks !

@kedoodle
Copy link
Contributor Author

Hello @kedoodle

thanks for your work. Your expression doesn't work with this metric

awsquota_elb_listeners_per_clb

We are trying to find a new alert rule, we will get back to you !

Thanks !

Hopefully you can adapt the expression to something that works for your use case in leiu of the proposed awsquota_usage and awsquota_limit breaking change being implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants