-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send resolved notification for silenced alerts #226
Comments
This would be very useful for teams. Hope to see this feature soon. |
you could implement this with sending an acknowledge instead of a resolved notification? |
I think I have a slightly different use case. We often have 3 levels of alerts info, warning, critical. At critical Pager Duty gets notified. All alert levels send notifications to hipchat. I often silence an alert when I see its at the warning level so if it goes critical it doesn't notify pager duty. But then I miss the resolved hipchat message so I have to keep checking the status of the alert manually. I don't really want to silence resolves. I wouldn't mind a different message going HC letting everyone know I silenced it, that would be useful info. But I wouldn't want a message going to HC stating the alert has be resolved when I silence it, that would be misleading. |
+1 we use our own status-board to notify external users if any problems in infrastructure occurs. We use webhooks to create events there, so, when problem occurs - we want external users to see it on status-board but sometimes we don't want to get bunch of emails (slack, etc) notifies as we already know about the problem, so we silence an alert. But if there is no resolve web-hook since alert is silenced - we see that problem is still present on our status-board when it's actually resolved. That brings some confusion and it would be nice if silenced alerts would report resolved state. |
This would be incredibly useful.
Is there another issue or something with an explanation of what you mean here? |
@fabxc could you elaborate a bit? I'm willing to put half a day or so into this, as it'd be a great improvement to the Prometheus + Pagerduty combo. The implementation looks fairly straighforward except for uncertainty from your note above. |
@fabxc ping on the above question: how do you see this interact with precomputed silences? I would like to work on this, but would like to avoid doing something that has to be thrown away in the near future. |
Another ping here. Any other maintainers, could you route this to the right person? |
+1 We're using Alerta to manage alerts from Alertmanager through a web hook. There's a plugin that lets us silence alerts in Alertmanager by when they are acknowledged in Alerta. When a silenced alert clears there is no resolved notification sent to Alerta so the alert still looks like it's unresolved. If the problem recurs and and it's still acknowledged in Alerta, it's automatically silenced in Alertmanager, because the silence for the alrert remains active in Alertmanger, peventing alerts from being sent to other destinations. I think that if send_resolved is true in the webhook_configs the notification should be sent even for silenced alerts and the silence should be expired in Alertmanager. |
any comment from the maintainers? |
I am not an Alertmanager maintainer, but the semantic meaning of a silence is that they stop notifications, but a silence does not indicate that the matched alerts are resolved (on the contrary, often times you still want to see them as unresolved, active alerts on the Alertmanager dashboard, they should just not notify anyone anymore). I'm not sure how to best marry that concept to platforms like Alerta, but just wanted to provide background information to explain why this would be conceptually problematic. |
Yes, but there is no problem with silencing unresolved alerts. It's when they are resolved that silence becomes a problem. |
Exactly, if we will still get resolved notifications everything will be fine |
@PMDubuc @ezraroi I see, yes, that is different from what the initial issue description said: "Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent."... but then the discussion drifted toward sending resolved notifications when silenced alerts are actually resolved, which can make sense. In that case though, what do we do with cases where even a resolved notification pages or bugs someone, and they would be annoyed if a silence still allowed resolved notifications? Sounds like either we'd live with that possibility or introduce yet another option. |
Unresolved alerts are silenced because otherwise they are sent repeatedly. Repeated alerts can be annoying while the problem is being resolved or resolution needs to be delayed. On the other hand Resolved notifications are only sent once. It's nice to know when the monitor sees the problem is resolved. The behavior I would like is the way problem acknowledgements are handled in Nagios. |
@juliusv If the webhook payload includes whether or not the alert has been silenced when sending resolved notifications then it could be up to the receiver how to handle the different scenarios you describe. @PMDubuc Can you provide a link or describe briefly how Nagios hanldes this?
|
Systems like PagerDuty handle this by grouping all alerts with the same incident key (= alert grouping labels hash) without notifying about each subsequent one. This is kind of how Alertmanager expects receivers to behave. Or if they can't, then set the
For the webhook that'd be an option. How would that be handled for all the other more specific receiver types though? Sending resolved notifications for silenced alerts on the webhook receiver only would seem inconsistent. |
Well, this is fine. Alerta handle this case too, but I don't think it can be expected of all receivers like email or others that may have a proprietary interface. If repeated notifications stop when they have not been silenced, they can be expired by the receiver. So "silencing" the alert is a form of acknowledgement that there is a problem and someone knows about it and is working on it. But, when the problem clears, how are Alertmanager receivers supposed to be notified if the resolved notification is not sent? I explained this problem in my Aug. 2nd comment above. I don't understand why silencing alerts also applies to resolve notification when @satterly |
I would expect that anything stopping the alert from firing would result in the same effect, whether that be the alert resolving, a silence, or an inhibition. |
As I have tried to explain, I think having an exception to this for resolved notifications makes sense and the lack of this exception presents real problems for receivers. Also when silences persist after a problem has been resolved, it can prevent a new instance of the problem from being detected. |
Silences can cover arbitary label combinations and do not have to correspond exactly to the alerts in an alert grouping. So you'd have a hard time finding a silence exactly matching that alert group in which all alerts just got resolved. Secondly, silences are frequently used to suppress notifications about flapping alerts, or in maintenance situations where stuff can be going up and down for a while, and you wouldn't want a resolved alert to remove matching silence there either. But yeah, maybe resolve notification behavior should be changed. |
maybe we can have property in the config the same as |
I don't see this as something that should be configurable, and send_resolved already causes enough implementation problems. |
OK, so @brian-brazil my suggestion is:
I think this stays consistent and allows platform to have all the information they need. What do you say? |
Yes, at the next group_interval.
Notifications don't have a timestamp, so I presume you're talking about alert start time which is an implementation detail you shouldn't depend on. It might be the same value as in previous notifications, it might be a different one - same as always.
I think you might be mixing up definitions of resolved. Resolved for the alertmanager is that the alert does not send firing notifications, it doesn't tell us anything about whether the underlying issue is resolved as that'd require a human to determine. See https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped |
hmm, I get where you’re coming from with your note about underlying issue and I do agree with that post wholeheartedly (even if current situation may suggest otherwise heh) But I think I should clear up what I meant: “when this new pseudo-resolved notification is sent, the (prometheus) alert is still firing.” (e.g I was not referring to the real-world underlying issue) With this new pseudo resolution, we basically could end up with different states between:
It would probably be okay for downstream receivers in charge of paging but for historical / alert-recoding receivers, I think this could be inaccurate and misleading. If it is the intention for AM to restrict the purpose of webhook receivers to pagers, I think it may be okay but I don’t think that is the case, right? (Side note: Your solution does solve my very specific issue of having orphaned unresolved incidents when alerts resolve during silences) |
That's not the intention - it's meant to cover anything. I'd personally never send resolved alerts to humans in any case, they're only a distraction from potential firing notifications.
That's a general problem with treating resolved notifications as meaning the alert as resolved, which is already incorrect today due to inhibitions and how group_interval works. If you're trying to get a 100% complete log of firing alerts (as distinct from notificatinos), the webhook is not a good way to do that.
That's the basic problem I see here, we're inconsistent. The notification should have happened after the silence was put in. |
What if it's changed so a silence can optionally be expired when the issue is resolved instead of after a set time? In this case the resolved notification is sent, otherwise it's not sent until the silence expires on its own (or it's not sent at all when the silence expires on its own if that's the way it works now). Would that cover each scenario in a consistent way that also avoids confusion? |
Do you have any suggestions as to how that (complete log of firing alerts, not notifs) might be achieved? Silences/inhibits can be disabled altogether, but then I think there would be no real way to ignore certain alerts (not notifs), except maybe to build that logic downstream. |
Silence should be tied to the current state of the alert and not to the alert, IMO. In several alerting platforms, "ack" means "don't alert me on this until the state changes" and that is something, I believe, a majority of people on this issue want to achieve. The only case when this is backfiring is when the alert state flaps, but, as @fabxc pointed out years ago, "frequent state change of an alert can be prevented by using a fitting alerting expression", so that's a non-issue. @brian-brazil by not at least providing an option to remove silence on state change you're essentially forcing your view on how the monitoring should be done on a lot of users and other monitoring platform authors who don't necessarily agree with you and/or have a long time existing, proven-and-true platforms which can't easily integrate with Alertmanager just because of your inflexible stance on this. I mean, just look at the guy who runs two Alertmanagers serially in order to fix his problem... |
I think this is getting at the core of the issue. My team consistently finds ourselves having to manually resolve alerts that are no longer firing simply because Alertmanager's silences are way too simple. In our case, we would like silences to prevent notifications when new alerts come up, but allow notifications when alerts transition to resolved. If silences were augmented to allow us (either the silence creator or Alertmanager operator) to select specific state transitions, this would completely solve the problem for us. I cannot count the number of times I have had to explain to our engineers why they are still being alerted on something that has resolved. After the explanation, there is always a shared consensus that Alertmanager is doing it wrong. |
I actually built a small work around for this for the new grafana alertmanager with a webserver: But I agree with @isavcic , should be a users desire to decide how the alerts are managed. |
Having automatic "end-by-resolve" ack/silences would be a most desireable feature indeed! Also, is there any way of adding a notification at the end of silence period? Because, for example, some very strange or serious alerts are firing, people are panicking, investigation starts only to find out it's an ended silence from a month ago. |
my workaround: use custom webhook server to send alert & handle silent |
@tulequ Would you mind sharing a little more detail about your workaround? Since in this scenario the alert is silenced, I would imagine that your webhook is never called from Alertmanager. |
I have a idea, #2811 |
Hi Everyone, This topic has been long standing and like @roidelapluie mentioned earlier, the original request has digressed into multiple feature requests. However I was wondering if we could reach a consensus on a subset of the requirements (excuse my naivety :-), I am new to this community ) . What I could gather from all of the discussions above is that most of us seem to agree/ wish that the AlertManager should send a A production use-case is documented in #2754 which demonstrates an issue we face commonly Use caseIn the below flow, there is theoretical inhibition that says
I wonder if the maintainers (@simonpasquier, @w0rm, @roidelapluie ) are open to allowing this Usecase to be addressed by adding a config |
I think that is one case, another one is "nagios like" alerting, where Alertmanager would send resolved, even if an alert is silenced. At the moment, alerts that are silenced and resolved in the silence timeframe, don't get resolved in PagerDuty and you get orphaned alerts. |
Friendly bump for the PR #3034 :) Is there anyone around that could review/suggest a different approach? It would be SO helpful to us too! 🙏 |
We have two setups and are lacking the feature of resolving existing alerts for inhibited alerts in one of them: Decentralized observability (one prometheus & alertmanager per k8s cluster)
To silence a cluster, we are using a custom operator that deletes the alertmanager's receiver (the pagerduty service). After the silence is removed, the service is re-created. Centralized observability (one alertmanager and centralized prometheus for the whole "fleet" of clusters)
To silence a cluster, we create an "inhibition alert", effectively inhibiting all alerts matching the cluster's identifier. This inhibition alert is filtered out in PagerDuty by severity to not page. Our issue with this is that alerts created before the inhibition are still present in PagerDuty, causing confusion and clutter.
I believe we are currently using a similar setup to get rid of alerts created on PagerDuty before the inhibition: we are using a PagerDuty webhook triggering when an "inhibition alert" is received. This webhook makes an API call to one of our services to resolve all active alerts that would be inhibited. Without the feature requested here, it seems like adding your own automation to clean up after alertmanager is necessary to deal with the problem. |
Could someone review this PR for a potential fix? 🙏 We're running into the same issue, fixed there. |
Alerts should be marked as resolved as they are silenced so a respective resolved notification is sent. This removes the need to manually resolve these in PagerDuty and friends.
We should probably provide information that tells whether it was a an actual resolve or a resolve-via-silence.
This puts another dimension to the problem of pre-computing silences (i.e. not silence at notification time) in a sane way.
The text was updated successfully, but these errors were encountered: