Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add group creation time to group_by hash #3959

Open
ccope opened this issue Aug 16, 2024 · 1 comment
Open

Feature request: Add group creation time to group_by hash #3959

ccope opened this issue Aug 16, 2024 · 1 comment

Comments

@ccope
Copy link

ccope commented Aug 16, 2024

What did you do?

  • Alert fired, started flapping
  • Ops person manually silenced the alert
  • Ops person also acknowledged the alert in Opsgenie

What did you expect to see?

  • A new batch of alerts should not be grouped into an already resolved group

What did you see instead? Under which circumstances?

  • Alert resolved, alertmanager did not send a notification to Opsgenie due to Send resolved notification for silenced alerts #226
  • A week later, new hosts started alerting, but were grouped into the same already-acknowledged incident in Opsgenie
  • Full datacenter outage occurs due to missed alerts

Environment

  • Alertmanager version:
alertmanager, version 0.24.0 (branch: HEAD, revision: f484b17fa3c583ed1b2c8bbcec20ba1db2aa5f11)
  build user:       root@265f14f5c6fc
  build date:       20220325-09:31:33
  go version:       go1.17.8
  platform:         linux/amd64
@grobinson-grafana
Copy link
Contributor

Hi! 👋

I think the main issue here is that Alertmanager cannot close incidents if all alerts in a group are silenced.

When silencing alerts for an active incident, you need to take care and make sure the incident is closed in your IRM (Opsgenie). If you leave the incident open, and new alerts are sent from Alertmanager to the same incident, you may or may not get paged for them.

I also recommend checking your Opsgenie configuration, as it sounds like the incident might have been left open by mistake? This shouldn't happen as you should be paged at regular intervals for active incidents until they are resolved.

To answer some of your questions:

Add group creation time to group_by hash

This won't work I'm afraid. Consider the case where the system clock on two Alertmanager servers are out of sync by 1ns. You will have different group creation times on each Alertmanager server, creating duplicate incidents in your IRM.

A new batch of alerts should not be grouped into an already resolved group

Given it had been a week since the last alert was resolved, and I assume there were no other active alerts in the group during that time, Alertmanager would have created a new group for these new alerts. However, group keys are deterministic, and if a group is "re-opened" it will re-use the same group key. This is intentional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@ccope @grobinson-grafana and others