-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change interval to 1m for etcdHighNumberOfFailedGRPCRequests critical #12178
Conversation
We've observed an erratic behaviour of this alert during rollouts. On all occasions we've noticed the alert isn't sustain for more than a few minutes and we flag it as a false positive.Because we're using the 5m rate for 5 minutes the alert only gets to see 1 sample. Doubling the alert duration (from 5m to 10m) but that seems unfeasible given the criticality of this, or use the 1m rate for 5m. The latter feels like a minimal change - and it didn't cause the alert to fire on occasions where the current one would. Signed-off-by: gotjosh <josue@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Just an FYI that the CI failure seems unrelated. |
@xiang90 apologies for ping you directly, any chance to take a look on this? 🙏 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Still valid for us |
/ | ||
sum(rate(grpc_server_handled_total{%(etcd_selector)s}[5m])) without (grpc_type, grpc_code) | ||
sum(rate(grpc_server_handled_total{%(etcd_selector)s}[1m])) without (grpc_type, grpc_code) | ||
> 5 | ||
||| % $._config, | ||
'for': '5m', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that in worst case the service might by 80% of time unhealthy and the alert will not trigger ?
4min of failures, 1 min below threshold, 4min of failures, ...
Shell we decrease it to 2 or 3 min ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[marking as a requested change]
Interesting. Thank you for the contribution. |
/ | ||
sum(rate(grpc_server_handled_total{%(etcd_selector)s}[5m])) without (grpc_type, grpc_code) | ||
sum(rate(grpc_server_handled_total{%(etcd_selector)s}[1m])) without (grpc_type, grpc_code) | ||
> 5 | ||
||| % $._config, | ||
'for': '5m', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[marking as a requested change]
@gotjosh thanks for the PR, and fyi that the project Doc is moved to https://github.com/etcd-io/website/ so this PR should be closed and need to open it in the doc repo. /cc @ptabor @nate-double-u |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gotjosh thanks for the PR, and fyi that the project Doc is moved to https://github.com/etcd-io/website/ so this PR should be closed and need to open it in the doc repo.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
We've observed an erratic behaviour of this alert during rollouts. On all occasions, we've noticed the alert isn't sustained for more than a few minutes and we flag it as a false positive. I think it's because we're using the 5m rate for 5 minutes the alert only gets to see 1 sample.
We have a couple of options, doubling the alert duration (from 5m to 10m) but that seems unfeasible given the criticality of this, or use the 1m rate for 5m. The latter feels like a minimal change - and it didn't cause the alert to fire (for false positives) on occasions where the current one would.
Apologies for not creating an issue first, given the minimal diff of the change I thought this would be better discussed inside of a pull request.
current interval (5m)
proposed interval (1m)
Signed-off-by: gotjosh josue@grafana.com