Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send resolved notification for silenced alerts #226

Open
fabxc opened this issue Jan 10, 2016 · 64 comments
Open

Send resolved notification for silenced alerts #226

fabxc opened this issue Jan 10, 2016 · 64 comments

Comments

@fabxc
Copy link
Contributor

fabxc commented Jan 10, 2016

Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent. This removes the need to manually resolve these in PagerDuty and friends.

We should probably provide information that tells whether it was a an actual resolve or a resolve-via-silence.

This puts another dimension to the problem of pre-computing silences (i.e. not silence at notification time) in a sane way.

@raypettersen
Copy link

This would be very useful for teams. Hope to see this feature soon.

@lswith
Copy link

lswith commented Nov 30, 2016

you could implement this with sending an acknowledge instead of a resolved notification?

@jkemp101
Copy link

I think I have a slightly different use case. We often have 3 levels of alerts info, warning, critical. At critical Pager Duty gets notified. All alert levels send notifications to hipchat. I often silence an alert when I see its at the warning level so if it goes critical it doesn't notify pager duty. But then I miss the resolved hipchat message so I have to keep checking the status of the alert manually. I don't really want to silence resolves.

I wouldn't mind a different message going HC letting everyone know I silenced it, that would be useful info. But I wouldn't want a message going to HC stating the alert has be resolved when I silence it, that would be misleading.

@ivan-kiselev
Copy link

+1 we use our own status-board to notify external users if any problems in infrastructure occurs. We use webhooks to create events there, so, when problem occurs - we want external users to see it on status-board but sometimes we don't want to get bunch of emails (slack, etc) notifies as we already know about the problem, so we silence an alert. But if there is no resolve web-hook since alert is silenced - we see that problem is still present on our status-board when it's actually resolved. That brings some confusion and it would be nice if silenced alerts would report resolved state.

@kamalmarhubi
Copy link

This would be incredibly useful.

@fabxc

This puts another dimension to the problem of pre-computing silences (i.e. not silence at notification time) in a sane way.

Is there another issue or something with an explanation of what you mean here?

@kamalmarhubi
Copy link

@fabxc could you elaborate a bit? I'm willing to put half a day or so into this, as it'd be a great improvement to the Prometheus + Pagerduty combo. The implementation looks fairly straighforward except for uncertainty from your note above.

@kamalmarhubi
Copy link

@fabxc ping on the above question: how do you see this interact with precomputed silences? I would like to work on this, but would like to avoid doing something that has to be thrown away in the near future.

@kamalmarhubi
Copy link

Another ping here. Any other maintainers, could you route this to the right person?

@PMDubuc
Copy link

PMDubuc commented Aug 2, 2018

+1 We're using Alerta to manage alerts from Alertmanager through a web hook. There's a plugin that lets us silence alerts in Alertmanager by when they are acknowledged in Alerta. When a silenced alert clears there is no resolved notification sent to Alerta so the alert still looks like it's unresolved. If the problem recurs and and it's still acknowledged in Alerta, it's automatically silenced in Alertmanager, because the silence for the alrert remains active in Alertmanger, peventing alerts from being sent to other destinations. I think that if send_resolved is true in the webhook_configs the notification should be sent even for silenced alerts and the silence should be expired in Alertmanager.

@ezraroi
Copy link

ezraroi commented Dec 26, 2018

any comment from the maintainers?

@juliusv
Copy link
Member

juliusv commented Dec 26, 2018

I am not an Alertmanager maintainer, but the semantic meaning of a silence is that they stop notifications, but a silence does not indicate that the matched alerts are resolved (on the contrary, often times you still want to see them as unresolved, active alerts on the Alertmanager dashboard, they should just not notify anyone anymore). I'm not sure how to best marry that concept to platforms like Alerta, but just wanted to provide background information to explain why this would be conceptually problematic.

@PMDubuc
Copy link

PMDubuc commented Dec 26, 2018

I am not an Alertmanager maintainer, but the semantic meaning of a silence is that they stop notifications, but a silence does not indicate that the matched alerts are resolved (on the contrary, often times you still want to see them as unresolved, active alerts on the Alertmanager dashboard, they should just not notify anyone anymore). I'm not sure how to best marry that concept to platforms like Alerta, but just wanted to provide background information to explain why this would be conceptually problematic.

Yes, but there is no problem with silencing unresolved alerts. It's when they are resolved that silence becomes a problem.

@ezraroi
Copy link

ezraroi commented Dec 26, 2018

Exactly, if we will still get resolved notifications everything will be fine

@juliusv
Copy link
Member

juliusv commented Dec 27, 2018

@PMDubuc @ezraroi I see, yes, that is different from what the initial issue description said: "Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent."... but then the discussion drifted toward sending resolved notifications when silenced alerts are actually resolved, which can make sense. In that case though, what do we do with cases where even a resolved notification pages or bugs someone, and they would be annoyed if a silence still allowed resolved notifications? Sounds like either we'd live with that possibility or introduce yet another option.

@PMDubuc
Copy link

PMDubuc commented Dec 27, 2018

@PMDubuc @ezraroi I see, yes, that is different from what the initial issue description said: "Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent."... but then the discussion drifted toward sending resolved notifications when silenced alerts are actually resolved, which can make sense. In that case though, what do we do with cases where even a resolved notification pages or bugs someone, and they would be annoyed if a silence still allowed resolved notifications? Sounds like either we'd live with that possibility or introduce yet another option.

Unresolved alerts are silenced because otherwise they are sent repeatedly. Repeated alerts can be annoying while the problem is being resolved or resolution needs to be delayed. On the other hand Resolved notifications are only sent once. It's nice to know when the monitor sees the problem is resolved. The behavior I would like is the way problem acknowledgements are handled in Nagios.

@satterly
Copy link

@juliusv If the webhook payload includes whether or not the alert has been silenced when sending resolved notifications then it could be up to the receiver how to handle the different scenarios you describe.

@PMDubuc Can you provide a link or describe briefly how Nagios hanldes this?

The behavior I would like is the way problem acknowledgements are handled in Nagios.

@juliusv
Copy link
Member

juliusv commented Dec 27, 2018

@PMDubuc

Unresolved alerts are silenced because otherwise they are sent repeatedly.

Systems like PagerDuty handle this by grouping all alerts with the same incident key (= alert grouping labels hash) without notifying about each subsequent one. This is kind of how Alertmanager expects receivers to behave. Or if they can't, then set the repeat_interval really high, so you get fewer (or practically no) repetitions.

@satterly

If the webhook payload includes whether or not the alert has been silenced when sending resolved notifications then it could be up to the receiver how to handle the different scenarios you describe.

For the webhook that'd be an option. How would that be handled for all the other more specific receiver types though? Sending resolved notifications for silenced alerts on the webhook receiver only would seem inconsistent.

@PMDubuc
Copy link

PMDubuc commented Dec 27, 2018

@PMDubuc

Unresolved alerts are silenced because otherwise they are sent repeatedly.

Systems like PagerDuty handle this by grouping all alerts with the same incident key (= alert grouping labels hash) without notifying about each subsequent one. This is kind of how Alertmanager expects receivers to behave. Or if they can't, then set the repeat_interval really high, so you get fewer (or practically no) repetitions.

Well, this is fine. Alerta handle this case too, but I don't think it can be expected of all receivers like email or others that may have a proprietary interface. If repeated notifications stop when they have not been silenced, they can be expired by the receiver. So "silencing" the alert is a form of acknowledgement that there is a problem and someone knows about it and is working on it. But, when the problem clears, how are Alertmanager receivers supposed to be notified if the resolved notification is not sent? I explained this problem in my Aug. 2nd comment above. I don't understand why silencing alerts also applies to resolve notification when send_resolved is true. I would think this would also be a problem for other receivers like PagerDuty. If no notification is sent when a problem is resolved, receivers can't update their status for the problem.

@satterly
The way Nagios handles notifications for problems that are acknowledged is to silence the active problem notifications. When the problem clears, an OK status notification is sent and the acknowledgement is automatically removed. I think Alertmanager should do the same thing with silences since they are also a form of acknowledgement of an active problem. When the problem clears, the silence should be cleared also. If the problem clears, it makes no sense to silence notifications about it. A resolved notification should be sent sent (unless send_resolved is false).

@brian-brazil
Copy link
Contributor

I would expect that anything stopping the alert from firing would result in the same effect, whether that be the alert resolving, a silence, or an inhibition.

@PMDubuc
Copy link

PMDubuc commented Dec 27, 2018

I would expect that anything stopping the alert from firing would result in the same effect, whether that be the alert resolving, a silence, or an inhibition.

As I have tried to explain, I think having an exception to this for resolved notifications makes sense and the lack of this exception presents real problems for receivers. Also when silences persist after a problem has been resolved, it can prevent a new instance of the problem from being detected.

@juliusv
Copy link
Member

juliusv commented Dec 27, 2018

@PMDubuc

Also when silences persist after a problem has been resolved, it can prevent a new instance of the problem from being detected.

Silences can cover arbitary label combinations and do not have to correspond exactly to the alerts in an alert grouping. So you'd have a hard time finding a silence exactly matching that alert group in which all alerts just got resolved. Secondly, silences are frequently used to suppress notifications about flapping alerts, or in maintenance situations where stuff can be going up and down for a while, and you wouldn't want a resolved alert to remove matching silence there either.

But yeah, maybe resolve notification behavior should be changed.

@ezraroi
Copy link

ezraroi commented Dec 30, 2018

maybe we can have property in the config the same as send_resolved for the webhook that will send information about silences of alerts. This will ease the integration of other platforms with alert manager

@brian-brazil
Copy link
Contributor

I don't see this as something that should be configurable, and send_resolved already causes enough implementation problems.

@ezraroi
Copy link

ezraroi commented Dec 31, 2018

OK, so @brian-brazil my suggestion is:

  1. When silenced we can send silenced event to the web hook and stop sending fire even as long as it is silenced .
  2. When alert is resolved, send resolved event to the web hook regardless the silence status

I think this stays consistent and allows platform to have all the information they need. What do you say?

@brian-brazil
Copy link
Contributor

And what if the silence expires and the same alert is still firing? Is another notification for “firing” going to be sent in this case.

Yes, at the next group_interval.

What would the notification look like? (with the original time stamp or treated like another new alert?)

Notifications don't have a timestamp, so I presume you're talking about alert start time which is an implementation detail you shouldn't depend on. It might be the same value as in previous notifications, it might be a different one - same as always.

this pseudo resolution would wrongly tell downstream systems that the alert is resolved when it is potentially not.

I think you might be mixing up definitions of resolved. Resolved for the alertmanager is that the alert does not send firing notifications, it doesn't tell us anything about whether the underlying issue is resolved as that'd require a human to determine. See https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

@aranair
Copy link

aranair commented Nov 12, 2020

I think you might be mixing up definitions of resolved. Resolved for the alertmanager is that the alert does not send firing notifications, it doesn't tell us anything about whether the underlying issue is resolved as that'd require a human to determine. See https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

hmm, I get where you’re coming from with your note about underlying issue and I do agree with that post wholeheartedly (even if current situation may suggest otherwise heh)

But I think I should clear up what I meant: “when this new pseudo-resolved notification is sent, the (prometheus) alert is still firing.” (e.g I was not referring to the real-world underlying issue)

With this new pseudo resolution, we basically could end up with different states between:

  • Prometheus rule/alert, firing
  • Alertmanager, tells everyone downstream that it’s resolved when it’s really just silenced (but actually knows it’s still firing)
  • Downstream receives a resolution; resolves the incident. (again, not in anyway implying humans stop looking) - while still having a firing Prometheus alert.

It would probably be okay for downstream receivers in charge of paging but for historical / alert-recoding receivers, I think this could be inaccurate and misleading. If it is the intention for AM to restrict the purpose of webhook receivers to pagers, I think it may be okay but I don’t think that is the case, right?

(Side note: Your solution does solve my very specific issue of having orphaned unresolved incidents when alerts resolve during silences)

@brian-brazil
Copy link
Contributor

If it is the intention for AM to restrict the purpose of webhook receivers to pagers, I think it may be okay but I don’t think that is the case, right?

That's not the intention - it's meant to cover anything. I'd personally never send resolved alerts to humans in any case, they're only a distraction from potential firing notifications.

It would probably be okay for downstream receivers in charge of paging but for historical / alert-recoding receivers, I think this could be inaccurate and misleading.

That's a general problem with treating resolved notifications as meaning the alert as resolved, which is already incorrect today due to inhibitions and how group_interval works. If you're trying to get a 100% complete log of firing alerts (as distinct from notificatinos), the webhook is not a good way to do that.

Your solution does solve my very specific issue of having orphaned unresolved incidents when alerts resolve during silences

That's the basic problem I see here, we're inconsistent. The notification should have happened after the silence was put in.

@PMDubuc
Copy link

PMDubuc commented Nov 13, 2020

What if it's changed so a silence can optionally be expired when the issue is resolved instead of after a set time? In this case the resolved notification is sent, otherwise it's not sent until the silence expires on its own (or it's not sent at all when the silence expires on its own if that's the way it works now). Would that cover each scenario in a consistent way that also avoids confusion?

@aranair
Copy link

aranair commented Nov 18, 2020

That's a general problem with treating resolved notifications as meaning the alert as resolved, which is already incorrect today due to inhibitions and how group_interval works. If you're trying to get a 100% complete log of firing alerts (as distinct from notificatinos), the webhook is not a good way to do that.

Do you have any suggestions as to how that (complete log of firing alerts, not notifs) might be achieved? Silences/inhibits can be disabled altogether, but then I think there would be no real way to ignore certain alerts (not notifs), except maybe to build that logic downstream.

@isavcic
Copy link

isavcic commented Mar 26, 2021

Silence should be tied to the current state of the alert and not to the alert, IMO.

In several alerting platforms, "ack" means "don't alert me on this until the state changes" and that is something, I believe, a majority of people on this issue want to achieve. The only case when this is backfiring is when the alert state flaps, but, as @fabxc pointed out years ago, "frequent state change of an alert can be prevented by using a fitting alerting expression", so that's a non-issue.

@brian-brazil by not at least providing an option to remove silence on state change you're essentially forcing your view on how the monitoring should be done on a lot of users and other monitoring platform authors who don't necessarily agree with you and/or have a long time existing, proven-and-true platforms which can't easily integrate with Alertmanager just because of your inflexible stance on this. I mean, just look at the guy who runs two Alertmanagers serially in order to fix his problem...

@jutley
Copy link

jutley commented Sep 3, 2021

Silence should be tied to the current state of the alert and not to the alert, IMO.

I think this is getting at the core of the issue. My team consistently finds ourselves having to manually resolve alerts that are no longer firing simply because Alertmanager's silences are way too simple. In our case, we would like silences to prevent notifications when new alerts come up, but allow notifications when alerts transition to resolved. If silences were augmented to allow us (either the silence creator or Alertmanager operator) to select specific state transitions, this would completely solve the problem for us.

I cannot count the number of times I have had to explain to our engineers why they are still being alerted on something that has resolved. After the explanation, there is always a shared consensus that Alertmanager is doing it wrong.

@tom0010
Copy link

tom0010 commented Sep 24, 2021

I actually built a small work around for this for the new grafana alertmanager with a webserver:

grafana/grafana#39615

But I agree with @isavcic , should be a users desire to decide how the alerts are managed.
And it seems that there are a lot of people wanting this feature.

@andrew-phi
Copy link

Having automatic "end-by-resolve" ack/silences would be a most desireable feature indeed! Also, is there any way of adding a notification at the end of silence period? Because, for example, some very strange or serious alerts are firing, people are panicking, investigation starts only to find out it's an ended silence from a month ago.

@tulequ
Copy link

tulequ commented Oct 11, 2021

my workaround: use custom webhook server to send alert & handle silent

@jutley
Copy link

jutley commented Nov 8, 2021

@tulequ Would you mind sharing a little more detail about your workaround? Since in this scenario the alert is silenced, I would imagine that your webhook is never called from Alertmanager.

@hw4liu
Copy link

hw4liu commented Jan 5, 2022

I have a idea, #2811

@sthaha
Copy link

sthaha commented Feb 10, 2022

Hi Everyone,

This topic has been long standing and like @roidelapluie mentioned earlier, the original request has digressed into multiple feature requests. However I was wondering if we could reach a consensus on a subset of the requirements (excuse my naivety :-), I am new to this community ) .

What I could gather from all of the discussions above is that most of us seem to agree/ wish that the AlertManager should send a resolved notification for alerts that are inhibited.

A production use-case is documented in #2754 which demonstrates an issue we face commonly

Use case

In the below flow, there is theoretical inhibition that says

when alert2 is firing, alert1 should be suppressed

NOTE: Prom = Prometheus | AM = Alertmanager | PD= PagerDuty

  1. Prom: alert1 fires
    - AM gets alert1, routes to PD
    - PD receives alert1

  2. Prom: alert2 fires
    - AM gets alert2
    - AM inhibits alert1
    - AM routes alert2 to PD
    - PD receives alert2
    - PD now has two alerts

    • alert1
    • alert2
  3. Prom: alert1 resolves
    - AM resolves alert1
    - PD receives no notification as the alert is suppressed
    - PD: alert1 becomes orphaned

  4. Prom: alert2 resolves
    - AM resolves alert2
    - PD receives resolved notification for alert2
    - PD resolves alert2
    - PD retains alert1 that is now orphaned

I wonder if the maintainers (@simonpasquier, @w0rm, @roidelapluie ) are open to allowing this Usecase to be addressed by adding a config notify_resolved_for_inhibitted_alerts (e.g.) which if enabled would send the resolved notification, there by preserving the current behaviour as default.

@matejzero
Copy link

I think that is one case, another one is "nagios like" alerting, where Alertmanager would send resolved, even if an alert is silenced. At the moment, alerts that are silenced and resolved in the silence timeframe, don't get resolved in PagerDuty and you get orphaned alerts.

@sgametrio
Copy link

Friendly bump for the PR #3034 :)

Is there anyone around that could review/suggest a different approach? It would be SO helpful to us too! 🙏

@typeid
Copy link

typeid commented Aug 3, 2023

We have two setups and are lacking the feature of resolving existing alerts for inhibited alerts in one of them:

Decentralized observability (one prometheus & alertmanager per k8s cluster)

  • Prometheus and alertmanager running on a k8s cluster
  • PagerDuty as receiver for the alerts

To silence a cluster, we are using a custom operator that deletes the alertmanager's receiver (the pagerduty service). After the silence is removed, the service is re-created.

Centralized observability (one alertmanager and centralized prometheus for the whole "fleet" of clusters)

  • Each k8s cluster is remote writing its metrics to the centralized prometheus
  • Centralized prometheus metrics are used for centralized alertmanager alerting

To silence a cluster, we create an "inhibition alert", effectively inhibiting all alerts matching the cluster's identifier. This inhibition alert is filtered out in PagerDuty by severity to not page.

Our issue with this is that alerts created before the inhibition are still present in PagerDuty, causing confusion and clutter.

@tulequ Would you mind sharing a little more detail about your workaround? Since in this scenario the alert is silenced, I would imagine that your webhook is never called from *Alertmanager.

I believe we are currently using a similar setup to get rid of alerts created on PagerDuty before the inhibition: we are using a PagerDuty webhook triggering when an "inhibition alert" is received. This webhook makes an API call to one of our services to resolve all active alerts that would be inhibited. Without the feature requested here, it seems like adding your own automation to clean up after alertmanager is necessary to deal with the problem.

@uberspot
Copy link

Friendly bump for the PR #3034 :)

Is there anyone around that could review/suggest a different approach? It would be SO helpful to us too! 🙏

Could someone review this PR for a potential fix? 🙏 We're running into the same issue, fixed there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests