-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation/op-guide: Add rules for Prometheus 2.0 #8848
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
groups: | ||
- name: etcd3_alert.rules | ||
rules: | ||
- alert: InsufficientMembers | ||
expr: count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1) | ||
for: 3m | ||
labels: | ||
severity: critical | ||
annotations: | ||
description: If one more etcd member goes down the cluster will be unavailable | ||
summary: etcd cluster insufficient members | ||
- alert: NoLeader | ||
expr: etcd_server_has_leader{job="etcd"} == 0 | ||
for: 1m | ||
labels: | ||
severity: critical | ||
annotations: | ||
description: etcd member {{ $labels.instance }} has no leader | ||
summary: etcd member has no leader | ||
- alert: HighNumberOfLeaderChanges | ||
expr: increase(etcd_server_leader_changes_seen_total{job="etcd"}[1h]) > 3 | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: etcd instance {{ $labels.instance }} has seen {{ $value }} leader | ||
changes within the last hour | ||
summary: a high number of leader changes within the etcd cluster are happening | ||
- alert: HighNumberOfFailedGRPCRequests | ||
expr: sum(rate(etcd_grpc_requests_failed_total{job="etcd"}[5m])) BY (grpc_method) | ||
/ sum(rate(etcd_grpc_total{job="etcd"}[5m])) BY (grpc_method) > 0.01 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: '{{ $value }}% of requests for {{ $labels.grpc_method }} failed | ||
on etcd instance {{ $labels.instance }}' | ||
summary: a high number of gRPC requests are failing | ||
- alert: HighNumberOfFailedGRPCRequests | ||
expr: sum(rate(etcd_grpc_requests_failed_total{job="etcd"}[5m])) BY (grpc_method) | ||
/ sum(rate(etcd_grpc_total{job="etcd"}[5m])) BY (grpc_method) > 0.05 | ||
for: 5m | ||
labels: | ||
severity: critical | ||
annotations: | ||
description: '{{ $value }}% of requests for {{ $labels.grpc_method }} failed | ||
on etcd instance {{ $labels.instance }}' | ||
summary: a high number of gRPC requests are failing | ||
- alert: GRPCRequestsSlow | ||
expr: histogram_quantile(0.99, rate(etcd_grpc_unary_requests_duration_seconds_bucket[5m])) | ||
> 0.15 | ||
for: 10m | ||
labels: | ||
severity: critical | ||
annotations: | ||
description: on etcd instance {{ $labels.instance }} gRPC requests to {{ $labels.grpc_method | ||
}} are slow | ||
summary: slow gRPC requests | ||
- alert: HighNumberOfFailedHTTPRequests | ||
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m])) | ||
BY (method) > 0.01 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd | ||
instance {{ $labels.instance }}' | ||
summary: a high number of HTTP requests are failing | ||
- alert: HighNumberOfFailedHTTPRequests | ||
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m])) | ||
BY (method) > 0.05 | ||
for: 5m | ||
labels: | ||
severity: critical | ||
annotations: | ||
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd | ||
instance {{ $labels.instance }}' | ||
summary: a high number of HTTP requests are failing | ||
- alert: HTTPRequestsSlow | ||
expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) | ||
> 0.15 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: on etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method | ||
}} are slow | ||
summary: slow HTTP requests | ||
- record: instance:fd_utilization | ||
expr: process_open_fds / process_max_fds | ||
- alert: FdExhaustionClose | ||
expr: predict_linear(instance:fd_utilization[1h], 3600 * 4) > 1 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: '{{ $labels.job }} instance {{ $labels.instance }} will exhaust | ||
its file descriptors soon' | ||
summary: file descriptors soon exhausted | ||
- alert: FdExhaustionClose | ||
expr: predict_linear(instance:fd_utilization[10m], 3600) > 1 | ||
for: 10m | ||
labels: | ||
severity: critical | ||
annotations: | ||
description: '{{ $labels.job }} instance {{ $labels.instance }} will exhaust | ||
its file descriptors soon' | ||
summary: file descriptors soon exhausted | ||
- alert: EtcdMemberCommunicationSlow | ||
expr: histogram_quantile(0.99, rate(etcd_network_member_round_trip_time_seconds_bucket[5m])) | ||
> 0.15 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: etcd instance {{ $labels.instance }} member communication with | ||
{{ $labels.To }} is slow | ||
summary: etcd member communication is slow | ||
- alert: HighNumberOfFailedProposals | ||
expr: increase(etcd_server_proposals_failed_total{job="etcd"}[1h]) > 5 | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: etcd instance {{ $labels.instance }} has seen {{ $value }} proposal | ||
failures within the last hour | ||
summary: a high number of proposals within the etcd cluster are failing | ||
- alert: HighFsyncDurations | ||
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) | ||
> 0.5 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: etcd instance {{ $labels.instance }} fync durations are high | ||
summary: high fsync durations | ||
- alert: HighCommitDurations | ||
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) | ||
> 0.25 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: etcd instance {{ $labels.instance }} commit durations are high | ||
summary: high commit durations |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -100,7 +100,7 @@ Now Prometheus will scrape etcd metrics every 10 seconds. | |
|
||
### Alerting | ||
|
||
There is a [set of default alerts for etcd v3 clusters](./etcd3_alert.rules). | ||
There is a set of default alerts for etcd v3 clusters for [Prometheus 1.x](./etcd3_alert.rules) as well as [Prometheus 2.x](./etcd3_alert.rules). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're absolutely right. I'll fix it with the follow-up PR! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
> Note: `job` labels may need to be adjusted to fit a particular need. The rules were written to apply to a single cluster so it is recommended to choose labels unique to a cluster. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
groups: | ||
- name: etcd_alert.rules | ||
rules: | ||
- alert: InsufficientMembers | ||
expr: count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1) | ||
for: 3m | ||
labels: | ||
severity: critical | ||
annotations: | ||
description: If one more etcd member goes down the cluster will be unavailable | ||
summary: etcd cluster insufficient members | ||
- alert: HighNumberOfFailedHTTPRequests | ||
expr: sum(rate(etcd_http_failed_total{code!~"^(?:4[0-9]{2})$",job="etcd"}[5m])) | ||
BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m])) BY (method) | ||
> 0.01 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd | ||
instance {{ $labels.instance }}' | ||
summary: a high number of HTTP requests are failing | ||
- alert: HighNumberOfFailedHTTPRequests | ||
expr: sum(rate(etcd_http_failed_total{code!~"^(?:4[0-9]{2})$",job="etcd"}[5m])) | ||
BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m])) BY (method) | ||
> 0.05 | ||
for: 5m | ||
labels: | ||
severity: critical | ||
annotations: | ||
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd | ||
instance {{ $labels.instance }}' | ||
summary: a high number of HTTP requests are failing | ||
- alert: HighNumberOfFailedHTTPRequests | ||
expr: sum(rate(etcd_http_failed_total{code=~"^(?:4[0-9]{2})$",job="etcd"}[5m])) | ||
BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m])) BY (method) | ||
> 0.5 | ||
for: 10m | ||
labels: | ||
severity: critical | ||
annotations: | ||
description: '{{ $value }}% of requests for {{ $labels.method }} failed with | ||
4xx responses on etcd instance {{ $labels.instance }}' | ||
summary: a high number of HTTP requests are failing | ||
- alert: HTTPRequestsSlow | ||
expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_second_bucket[5m])) | ||
> 0.15 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: on etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method | ||
}} are slow | ||
summary: slow HTTP requests | ||
- record: instance:fd_utilization | ||
expr: process_open_fds / process_max_fds | ||
- alert: FdExhaustionClose | ||
expr: predict_linear(instance:fd_utilization[1h], 3600 * 4) > 1 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: '{{ $labels.job }} instance {{ $labels.instance }} will exhaust | ||
its file descriptors soon' | ||
summary: file descriptors soon exhausted | ||
- alert: FdExhaustionClose | ||
expr: predict_linear(instance:fd_utilization[10m], 3600) > 1 | ||
for: 10m | ||
labels: | ||
severity: critical | ||
annotations: | ||
description: '{{ $labels.job }} instance {{ $labels.instance }} will exhaust | ||
its file descriptors soon' | ||
summary: file descriptors soon exhausted | ||
- alert: HighNumberOfFailedProposals | ||
expr: increase(etcd_server_proposal_failed_total{job="etcd"}[1h]) > 5 | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: etcd instance {{ $labels.instance }} has seen {{ $value }} proposal | ||
failures within the last hour | ||
summary: a high number of proposals within the etcd cluster are failing | ||
- alert: HighFsyncDurations | ||
expr: histogram_quantile(0.99, rate(etcd_wal_fsync_durations_seconds_bucket[5m])) | ||
> 0.5 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
description: etcd instance {{ $labels.instance }} fync durations are high | ||
summary: high fsync durations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, should we update this to our new go-grpc-prometheus metrics?
Reference: #8802
/cc @xiang90
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was going to do that in a subsequent PR, happy to do it here as well though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sounds good.
Can you file another PR?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.