Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change interval to 1m for etcdHighNumberOfFailedGRPCRequests critical #12178

Closed
wants to merge 1 commit into from

Conversation

gotjosh
Copy link

@gotjosh gotjosh commented Jul 28, 2020

We've observed an erratic behaviour of this alert during rollouts. On all occasions, we've noticed the alert isn't sustained for more than a few minutes and we flag it as a false positive. I think it's because we're using the 5m rate for 5 minutes the alert only gets to see 1 sample.

We have a couple of options, doubling the alert duration (from 5m to 10m) but that seems unfeasible given the criticality of this, or use the 1m rate for 5m. The latter feels like a minimal change - and it didn't cause the alert to fire (for false positives) on occasions where the current one would.

Apologies for not creating an issue first, given the minimal diff of the change I thought this would be better discussed inside of a pull request.

current interval (5m)
Screenshot 2020-07-28 at 11 12 17

proposed interval (1m)

Screenshot 2020-07-28 at 11 24 50

Signed-off-by: gotjosh josue@grafana.com

We've observed an erratic behaviour of this alert during rollouts. On
all occasions we've noticed the alert isn't sustain for more than a few
minutes and we flag it as a false positive.Because we're using the 5m rate for 5 minutes the alert only gets to see
1 sample.

Doubling the alert duration (from 5m to 10m) but that seems unfeasible
given the criticality of this, or use the 1m rate for 5m. The latter
feels like a minimal change - and it didn't cause the alert to fire on
occasions where the current one would.

Signed-off-by: gotjosh <josue@grafana.com>
Copy link

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@gotjosh
Copy link
Author

gotjosh commented Aug 3, 2020

Just an FYI that the CI failure seems unrelated.

@gotjosh
Copy link
Author

gotjosh commented Aug 18, 2020

@xiang90 apologies for ping you directly, any chance to take a look on this? 🙏

@stale
Copy link

stale bot commented Nov 16, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 16, 2020
@pracucci
Copy link

Still valid for us

@stale stale bot removed the stale label Nov 17, 2020
/
sum(rate(grpc_server_handled_total{%(etcd_selector)s}[5m])) without (grpc_type, grpc_code)
sum(rate(grpc_server_handled_total{%(etcd_selector)s}[1m])) without (grpc_type, grpc_code)
> 5
||| % $._config,
'for': '5m',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that in worst case the service might by 80% of time unhealthy and the alert will not trigger ?

4min of failures, 1 min below threshold, 4min of failures, ...

Shell we decrease it to 2 or 3 min ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[marking as a requested change]

@ptabor
Copy link
Contributor

ptabor commented Feb 1, 2021

Interesting. Thank you for the contribution.
I wonder [a little off topic], what are the errors (grpc_code & method) and in what circumstances do they happen ?
Is it a multi-node cluster being upgraded one-by-one ?
I want to make sure there is no bug that we would be hiding by such alerting change.

/
sum(rate(grpc_server_handled_total{%(etcd_selector)s}[5m])) without (grpc_type, grpc_code)
sum(rate(grpc_server_handled_total{%(etcd_selector)s}[1m])) without (grpc_type, grpc_code)
> 5
||| % $._config,
'for': '5m',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[marking as a requested change]

@spzala
Copy link
Member

spzala commented Feb 10, 2021

@gotjosh thanks for the PR, and fyi that the project Doc is moved to https://github.com/etcd-io/website/ so this PR should be closed and need to open it in the doc repo. /cc @ptabor @nate-double-u

Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gotjosh thanks for the PR, and fyi that the project Doc is moved to https://github.com/etcd-io/website/ so this PR should be closed and need to open it in the doc repo.

@stale
Copy link

stale bot commented May 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 12, 2021
@stale stale bot closed this Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

4 participants