Send resolved notification for silenced alerts #226

fabxc · 2016-01-10T18:29:50Z

Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent. This removes the need to manually resolve these in PagerDuty and friends.

We should probably provide information that tells whether it was a an actual resolve or a resolve-via-silence.

This puts another dimension to the problem of pre-computing silences (i.e. not silence at notification time) in a sane way.

raypettersen · 2016-10-05T06:50:32Z

This would be very useful for teams. Hope to see this feature soon.

lswith · 2016-11-30T02:16:30Z

you could implement this with sending an acknowledge instead of a resolved notification?

jkemp101 · 2016-12-16T14:32:37Z

I think I have a slightly different use case. We often have 3 levels of alerts info, warning, critical. At critical Pager Duty gets notified. All alert levels send notifications to hipchat. I often silence an alert when I see its at the warning level so if it goes critical it doesn't notify pager duty. But then I miss the resolved hipchat message so I have to keep checking the status of the alert manually. I don't really want to silence resolves.

I wouldn't mind a different message going HC letting everyone know I silenced it, that would be useful info. But I wouldn't want a message going to HC stating the alert has be resolved when I silence it, that would be misleading.

ivan-kiselev · 2017-08-23T08:48:02Z

+1 we use our own status-board to notify external users if any problems in infrastructure occurs. We use webhooks to create events there, so, when problem occurs - we want external users to see it on status-board but sometimes we don't want to get bunch of emails (slack, etc) notifies as we already know about the problem, so we silence an alert. But if there is no resolve web-hook since alert is silenced - we see that problem is still present on our status-board when it's actually resolved. That brings some confusion and it would be nice if silenced alerts would report resolved state.

kamalmarhubi · 2018-01-17T19:25:09Z

This would be incredibly useful.

@fabxc

This puts another dimension to the problem of pre-computing silences (i.e. not silence at notification time) in a sane way.

Is there another issue or something with an explanation of what you mean here?

kamalmarhubi · 2018-01-22T19:20:31Z

@fabxc could you elaborate a bit? I'm willing to put half a day or so into this, as it'd be a great improvement to the Prometheus + Pagerduty combo. The implementation looks fairly straighforward except for uncertainty from your note above.

kamalmarhubi · 2018-02-27T20:07:38Z

@fabxc ping on the above question: how do you see this interact with precomputed silences? I would like to work on this, but would like to avoid doing something that has to be thrown away in the near future.

kamalmarhubi · 2018-03-21T15:49:44Z

Another ping here. Any other maintainers, could you route this to the right person?

PMDubuc · 2018-08-02T17:59:07Z

+1 We're using Alerta to manage alerts from Alertmanager through a web hook. There's a plugin that lets us silence alerts in Alertmanager by when they are acknowledged in Alerta. When a silenced alert clears there is no resolved notification sent to Alerta so the alert still looks like it's unresolved. If the problem recurs and and it's still acknowledged in Alerta, it's automatically silenced in Alertmanager, because the silence for the alrert remains active in Alertmanger, peventing alerts from being sent to other destinations. I think that if send_resolved is true in the webhook_configs the notification should be sent even for silenced alerts and the silence should be expired in Alertmanager.

ezraroi · 2018-12-26T08:50:27Z

any comment from the maintainers?

juliusv · 2018-12-26T16:02:52Z

I am not an Alertmanager maintainer, but the semantic meaning of a silence is that they stop notifications, but a silence does not indicate that the matched alerts are resolved (on the contrary, often times you still want to see them as unresolved, active alerts on the Alertmanager dashboard, they should just not notify anyone anymore). I'm not sure how to best marry that concept to platforms like Alerta, but just wanted to provide background information to explain why this would be conceptually problematic.

PMDubuc · 2018-12-26T18:26:44Z

I am not an Alertmanager maintainer, but the semantic meaning of a silence is that they stop notifications, but a silence does not indicate that the matched alerts are resolved (on the contrary, often times you still want to see them as unresolved, active alerts on the Alertmanager dashboard, they should just not notify anyone anymore). I'm not sure how to best marry that concept to platforms like Alerta, but just wanted to provide background information to explain why this would be conceptually problematic.

Yes, but there is no problem with silencing unresolved alerts. It's when they are resolved that silence becomes a problem.

ezraroi · 2018-12-26T19:54:53Z

Exactly, if we will still get resolved notifications everything will be fine

juliusv · 2018-12-27T14:42:57Z

@PMDubuc @ezraroi I see, yes, that is different from what the initial issue description said: "Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent."... but then the discussion drifted toward sending resolved notifications when silenced alerts are actually resolved, which can make sense. In that case though, what do we do with cases where even a resolved notification pages or bugs someone, and they would be annoyed if a silence still allowed resolved notifications? Sounds like either we'd live with that possibility or introduce yet another option.

PMDubuc · 2018-12-27T15:53:36Z

@PMDubuc @ezraroi I see, yes, that is different from what the initial issue description said: "Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent."... but then the discussion drifted toward sending resolved notifications when silenced alerts are actually resolved, which can make sense. In that case though, what do we do with cases where even a resolved notification pages or bugs someone, and they would be annoyed if a silence still allowed resolved notifications? Sounds like either we'd live with that possibility or introduce yet another option.

Unresolved alerts are silenced because otherwise they are sent repeatedly. Repeated alerts can be annoying while the problem is being resolved or resolution needs to be delayed. On the other hand Resolved notifications are only sent once. It's nice to know when the monitor sees the problem is resolved. The behavior I would like is the way problem acknowledgements are handled in Nagios.

satterly · 2018-12-27T15:57:23Z

@juliusv If the webhook payload includes whether or not the alert has been silenced when sending resolved notifications then it could be up to the receiver how to handle the different scenarios you describe.

@PMDubuc Can you provide a link or describe briefly how Nagios hanldes this?

The behavior I would like is the way problem acknowledgements are handled in Nagios.

juliusv · 2018-12-27T16:05:29Z

@PMDubuc

Unresolved alerts are silenced because otherwise they are sent repeatedly.

Systems like PagerDuty handle this by grouping all alerts with the same incident key (= alert grouping labels hash) without notifying about each subsequent one. This is kind of how Alertmanager expects receivers to behave. Or if they can't, then set the repeat_interval really high, so you get fewer (or practically no) repetitions.

@satterly

If the webhook payload includes whether or not the alert has been silenced when sending resolved notifications then it could be up to the receiver how to handle the different scenarios you describe.

For the webhook that'd be an option. How would that be handled for all the other more specific receiver types though? Sending resolved notifications for silenced alerts on the webhook receiver only would seem inconsistent.

PMDubuc · 2018-12-27T16:46:04Z

@PMDubuc

Unresolved alerts are silenced because otherwise they are sent repeatedly.

Systems like PagerDuty handle this by grouping all alerts with the same incident key (= alert grouping labels hash) without notifying about each subsequent one. This is kind of how Alertmanager expects receivers to behave. Or if they can't, then set the repeat_interval really high, so you get fewer (or practically no) repetitions.

Well, this is fine. Alerta handle this case too, but I don't think it can be expected of all receivers like email or others that may have a proprietary interface. If repeated notifications stop when they have not been silenced, they can be expired by the receiver. So "silencing" the alert is a form of acknowledgement that there is a problem and someone knows about it and is working on it. But, when the problem clears, how are Alertmanager receivers supposed to be notified if the resolved notification is not sent? I explained this problem in my Aug. 2nd comment above. I don't understand why silencing alerts also applies to resolve notification when send_resolved is true. I would think this would also be a problem for other receivers like PagerDuty. If no notification is sent when a problem is resolved, receivers can't update their status for the problem.

@satterly
The way Nagios handles notifications for problems that are acknowledged is to silence the active problem notifications. When the problem clears, an OK status notification is sent and the acknowledgement is automatically removed. I think Alertmanager should do the same thing with silences since they are also a form of acknowledgement of an active problem. When the problem clears, the silence should be cleared also. If the problem clears, it makes no sense to silence notifications about it. A resolved notification should be sent sent (unless send_resolved is false).

brian-brazil · 2018-12-27T16:53:23Z

I would expect that anything stopping the alert from firing would result in the same effect, whether that be the alert resolving, a silence, or an inhibition.

PMDubuc · 2018-12-27T17:05:56Z

I would expect that anything stopping the alert from firing would result in the same effect, whether that be the alert resolving, a silence, or an inhibition.

As I have tried to explain, I think having an exception to this for resolved notifications makes sense and the lack of this exception presents real problems for receivers. Also when silences persist after a problem has been resolved, it can prevent a new instance of the problem from being detected.

juliusv · 2018-12-27T17:16:12Z

@PMDubuc

Also when silences persist after a problem has been resolved, it can prevent a new instance of the problem from being detected.

Silences can cover arbitary label combinations and do not have to correspond exactly to the alerts in an alert grouping. So you'd have a hard time finding a silence exactly matching that alert group in which all alerts just got resolved. Secondly, silences are frequently used to suppress notifications about flapping alerts, or in maintenance situations where stuff can be going up and down for a while, and you wouldn't want a resolved alert to remove matching silence there either.

But yeah, maybe resolve notification behavior should be changed.

ezraroi · 2018-12-30T13:52:10Z

maybe we can have property in the config the same as send_resolved for the webhook that will send information about silences of alerts. This will ease the integration of other platforms with alert manager

brian-brazil · 2018-12-30T15:25:00Z

I don't see this as something that should be configurable, and send_resolved already causes enough implementation problems.

ezraroi · 2018-12-31T06:50:16Z

OK, so @brian-brazil my suggestion is:

When silenced we can send silenced event to the web hook and stop sending fire even as long as it is silenced .
When alert is resolved, send resolved event to the web hook regardless the silence status

I think this stays consistent and allows platform to have all the information they need. What do you say?

brian-brazil · 2020-11-11T23:06:04Z

And what if the silence expires and the same alert is still firing? Is another notification for “firing” going to be sent in this case.

Yes, at the next group_interval.

What would the notification look like? (with the original time stamp or treated like another new alert?)

Notifications don't have a timestamp, so I presume you're talking about alert start time which is an implementation detail you shouldn't depend on. It might be the same value as in previous notifications, it might be a different one - same as always.

this pseudo resolution would wrongly tell downstream systems that the alert is resolved when it is potentially not.

I think you might be mixing up definitions of resolved. Resolved for the alertmanager is that the alert does not send firing notifications, it doesn't tell us anything about whether the underlying issue is resolved as that'd require a human to determine. See https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

aranair · 2020-11-12T03:04:37Z

I think you might be mixing up definitions of resolved. Resolved for the alertmanager is that the alert does not send firing notifications, it doesn't tell us anything about whether the underlying issue is resolved as that'd require a human to determine. See https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

hmm, I get where you’re coming from with your note about underlying issue and I do agree with that post wholeheartedly (even if current situation may suggest otherwise heh)

But I think I should clear up what I meant: “when this new pseudo-resolved notification is sent, the (prometheus) alert is still firing.” (e.g I was not referring to the real-world underlying issue)

With this new pseudo resolution, we basically could end up with different states between:

Prometheus rule/alert, firing
Alertmanager, tells everyone downstream that it’s resolved when it’s really just silenced (but actually knows it’s still firing)
Downstream receives a resolution; resolves the incident. (again, not in anyway implying humans stop looking) - while still having a firing Prometheus alert.

It would probably be okay for downstream receivers in charge of paging but for historical / alert-recoding receivers, I think this could be inaccurate and misleading. If it is the intention for AM to restrict the purpose of webhook receivers to pagers, I think it may be okay but I don’t think that is the case, right?

(Side note: Your solution does solve my very specific issue of having orphaned unresolved incidents when alerts resolve during silences)

brian-brazil · 2020-11-12T09:12:21Z

If it is the intention for AM to restrict the purpose of webhook receivers to pagers, I think it may be okay but I don’t think that is the case, right?

That's not the intention - it's meant to cover anything. I'd personally never send resolved alerts to humans in any case, they're only a distraction from potential firing notifications.

It would probably be okay for downstream receivers in charge of paging but for historical / alert-recoding receivers, I think this could be inaccurate and misleading.

That's a general problem with treating resolved notifications as meaning the alert as resolved, which is already incorrect today due to inhibitions and how group_interval works. If you're trying to get a 100% complete log of firing alerts (as distinct from notificatinos), the webhook is not a good way to do that.

Your solution does solve my very specific issue of having orphaned unresolved incidents when alerts resolve during silences

That's the basic problem I see here, we're inconsistent. The notification should have happened after the silence was put in.

PMDubuc · 2020-11-13T17:13:20Z

What if it's changed so a silence can optionally be expired when the issue is resolved instead of after a set time? In this case the resolved notification is sent, otherwise it's not sent until the silence expires on its own (or it's not sent at all when the silence expires on its own if that's the way it works now). Would that cover each scenario in a consistent way that also avoids confusion?

aranair · 2020-11-18T17:23:20Z

That's a general problem with treating resolved notifications as meaning the alert as resolved, which is already incorrect today due to inhibitions and how group_interval works. If you're trying to get a 100% complete log of firing alerts (as distinct from notificatinos), the webhook is not a good way to do that.

Do you have any suggestions as to how that (complete log of firing alerts, not notifs) might be achieved? Silences/inhibits can be disabled altogether, but then I think there would be no real way to ignore certain alerts (not notifs), except maybe to build that logic downstream.

isavcic · 2021-03-26T12:39:43Z

Silence should be tied to the current state of the alert and not to the alert, IMO.

In several alerting platforms, "ack" means "don't alert me on this until the state changes" and that is something, I believe, a majority of people on this issue want to achieve. The only case when this is backfiring is when the alert state flaps, but, as @fabxc pointed out years ago, "frequent state change of an alert can be prevented by using a fitting alerting expression", so that's a non-issue.

@brian-brazil by not at least providing an option to remove silence on state change you're essentially forcing your view on how the monitoring should be done on a lot of users and other monitoring platform authors who don't necessarily agree with you and/or have a long time existing, proven-and-true platforms which can't easily integrate with Alertmanager just because of your inflexible stance on this. I mean, just look at the guy who runs two Alertmanagers serially in order to fix his problem...

jutley · 2021-09-03T18:54:08Z

Silence should be tied to the current state of the alert and not to the alert, IMO.

I think this is getting at the core of the issue. My team consistently finds ourselves having to manually resolve alerts that are no longer firing simply because Alertmanager's silences are way too simple. In our case, we would like silences to prevent notifications when new alerts come up, but allow notifications when alerts transition to resolved. If silences were augmented to allow us (either the silence creator or Alertmanager operator) to select specific state transitions, this would completely solve the problem for us.

I cannot count the number of times I have had to explain to our engineers why they are still being alerted on something that has resolved. After the explanation, there is always a shared consensus that Alertmanager is doing it wrong.

tom0010 · 2021-09-24T21:14:40Z

I actually built a small work around for this for the new grafana alertmanager with a webserver:

grafana/grafana#39615

But I agree with @isavcic , should be a users desire to decide how the alerts are managed.
And it seems that there are a lot of people wanting this feature.

andrew-phi · 2021-10-07T10:45:50Z

Having automatic "end-by-resolve" ack/silences would be a most desireable feature indeed! Also, is there any way of adding a notification at the end of silence period? Because, for example, some very strange or serious alerts are firing, people are panicking, investigation starts only to find out it's an ended silence from a month ago.

tulequ · 2021-10-11T08:26:00Z

my workaround: use custom webhook server to send alert & handle silent

jutley · 2021-11-08T17:55:28Z

@tulequ Would you mind sharing a little more detail about your workaround? Since in this scenario the alert is silenced, I would imagine that your webhook is never called from Alertmanager.

hw4liu · 2022-01-05T14:21:32Z

I have a idea, #2811

sthaha · 2022-02-10T02:57:24Z

Hi Everyone,

This topic has been long standing and like @roidelapluie mentioned earlier, the original request has digressed into multiple feature requests. However I was wondering if we could reach a consensus on a subset of the requirements (excuse my naivety :-), I am new to this community ) .

What I could gather from all of the discussions above is that most of us seem to agree/ wish that the AlertManager should send a resolved notification for alerts that are inhibited.

A production use-case is documented in #2754 which demonstrates an issue we face commonly

Use case

In the below flow, there is theoretical inhibition that says

when alert2 is firing, alert1 should be suppressed

NOTE: Prom = Prometheus | AM = Alertmanager | PD= PagerDuty

Prom: alert1 fires
- AM gets alert1, routes to PD
- PD receives alert1
Prom: alert2 fires
- AM gets alert2
- AM inhibits alert1
- AM routes alert2 to PD
- PD receives alert2
- PD now has two alerts
- alert1
- alert2
Prom: alert1 resolves
- AM resolves alert1
- PD receives no notification as the alert is suppressed
- PD: alert1 becomes orphaned
Prom: alert2 resolves
- AM resolves alert2
- PD receives resolved notification for alert2
- PD resolves alert2
- PD retains alert1 that is now orphaned

I wonder if the maintainers (@simonpasquier, @w0rm, @roidelapluie ) are open to allowing this Usecase to be addressed by adding a config notify_resolved_for_inhibitted_alerts (e.g.) which if enabled would send the resolved notification, there by preserving the current behaviour as default.

matejzero · 2022-02-10T07:01:10Z

I think that is one case, another one is "nagios like" alerting, where Alertmanager would send resolved, even if an alert is silenced. At the moment, alerts that are silenced and resolved in the silence timeframe, don't get resolved in PagerDuty and you get orphaned alerts.

sgametrio · 2023-05-17T09:01:53Z

Friendly bump for the PR #3034 :)

Is there anyone around that could review/suggest a different approach? It would be SO helpful to us too! 🙏

typeid · 2023-08-03T07:24:26Z

We have two setups and are lacking the feature of resolving existing alerts for inhibited alerts in one of them:

Decentralized observability (one prometheus & alertmanager per k8s cluster)

Prometheus and alertmanager running on a k8s cluster
PagerDuty as receiver for the alerts

To silence a cluster, we are using a custom operator that deletes the alertmanager's receiver (the pagerduty service). After the silence is removed, the service is re-created.

Centralized observability (one alertmanager and centralized prometheus for the whole "fleet" of clusters)

Each k8s cluster is remote writing its metrics to the centralized prometheus
Centralized prometheus metrics are used for centralized alertmanager alerting

To silence a cluster, we create an "inhibition alert", effectively inhibiting all alerts matching the cluster's identifier. This inhibition alert is filtered out in PagerDuty by severity to not page.

Our issue with this is that alerts created before the inhibition are still present in PagerDuty, causing confusion and clutter.

@tulequ Would you mind sharing a little more detail about your workaround? Since in this scenario the alert is silenced, I would imagine that your webhook is never called from *Alertmanager.

I believe we are currently using a similar setup to get rid of alerts created on PagerDuty before the inhibition: we are using a PagerDuty webhook triggering when an "inhibition alert" is received. This webhook makes an API call to one of our services to resolve all active alerts that would be inhibited. Without the feature requested here, it seems like adding your own automation to clean up after alertmanager is necessary to deal with the problem.

uberspot · 2024-06-27T10:58:04Z

Friendly bump for the PR #3034 :)

Is there anyone around that could review/suggest a different approach? It would be SO helpful to us too! 🙏

Could someone review this PR for a potential fix? 🙏 We're running into the same issue, fixed there.

grobie added the feature-request label Apr 4, 2016

fabxc added kind/enhancement and removed feature request labels Apr 27, 2016

mxinden added component/notify priority/Pmaybe labels Aug 13, 2017

PMDubuc mentioned this issue Aug 29, 2018

prometheus silenced alerta/alerta#278

Closed

This was referenced Feb 26, 2021

Add time-based muting to routing tree #2393

Merged

Muting interacts with group_interval in unexpected (?) fashion #2496

Open

beorn7 mentioned this issue Mar 16, 2021

send_resolved does nothing on Pagerduty receiver #635

Closed

jan--f mentioned this issue Nov 4, 2021

Toggle ability to - send resolved notifications for inhibited alerts #2754

Open

simonpasquier mentioned this issue Jan 20, 2022

I hope when alert in silence status, I can receive the notice that alert has been resolved #2805

Closed

edevil mentioned this issue Jul 20, 2022

receive resloved meassage #2811

Closed

edevil mentioned this issue Aug 2, 2022

option to resolve alerts that fired inhibited #3034

Open

simonpasquier mentioned this issue Jan 6, 2023

Alerts are not closed at Opsgenie receiver #3188

Closed

simonpasquier mentioned this issue Jun 2, 2023

group_by: [alertname, alertstate] doesn't work as it should #2334

Open

JacobsonMT mentioned this issue Jan 22, 2024

Alerting: Alerts resolved while silenced do not send new notifications until repeat_interval grafana/grafana#80996

Closed

soniaAguilarPeiron mentioned this issue Jun 5, 2024

Alerting: Mute Timings Prevent OpsGenie Resolution and Auto-Close grafana/grafana#87007

Closed

ccope mentioned this issue Aug 16, 2024

Feature request: Add group creation time to group_by hash #3959

Open

anarcat mentioned this issue Nov 3, 2024

Silence notifications #730

Open

Send resolved notification for silenced alerts #226

Send resolved notification for silenced alerts #226

Comments

fabxc commented Jan 10, 2016

raypettersen commented Oct 5, 2016

lswith commented Nov 30, 2016

jkemp101 commented Dec 16, 2016

ivan-kiselev commented Aug 23, 2017

kamalmarhubi commented Jan 17, 2018

kamalmarhubi commented Jan 22, 2018

kamalmarhubi commented Feb 27, 2018

kamalmarhubi commented Mar 21, 2018

PMDubuc commented Aug 2, 2018 • edited Loading

ezraroi commented Dec 26, 2018 • edited Loading

juliusv commented Dec 26, 2018

PMDubuc commented Dec 26, 2018

ezraroi commented Dec 26, 2018

juliusv commented Dec 27, 2018

PMDubuc commented Dec 27, 2018 • edited Loading

satterly commented Dec 27, 2018

juliusv commented Dec 27, 2018

PMDubuc commented Dec 27, 2018 • edited Loading

brian-brazil commented Dec 27, 2018

PMDubuc commented Dec 27, 2018

juliusv commented Dec 27, 2018

ezraroi commented Dec 30, 2018

brian-brazil commented Dec 30, 2018

ezraroi commented Dec 31, 2018

brian-brazil commented Nov 11, 2020

aranair commented Nov 12, 2020 • edited Loading

brian-brazil commented Nov 12, 2020

PMDubuc commented Nov 13, 2020 • edited Loading

aranair commented Nov 18, 2020

isavcic commented Mar 26, 2021

jutley commented Sep 3, 2021

tom0010 commented Sep 24, 2021

andrew-phi commented Oct 7, 2021

tulequ commented Oct 11, 2021

jutley commented Nov 8, 2021

hw4liu commented Jan 5, 2022

sthaha commented Feb 10, 2022

matejzero commented Feb 10, 2022

sgametrio commented May 17, 2023

typeid commented Aug 3, 2023 • edited Loading

uberspot commented Jun 27, 2024

PMDubuc commented Aug 2, 2018 •

edited

Loading

ezraroi commented Dec 26, 2018 •

edited

Loading

PMDubuc commented Dec 27, 2018 •

edited

Loading

PMDubuc commented Dec 27, 2018 •

edited

Loading

aranair commented Nov 12, 2020 •

edited

Loading

PMDubuc commented Nov 13, 2020 •

edited

Loading

typeid commented Aug 3, 2023 •

edited

Loading