-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to fire actions when an alert instance is resolved #49405
Comments
Pinging @elastic/kibana-stack-services (Team:Stack Services) |
Seems like the easiest thing to do is to handle it as a a "built-in" action group. An alert executor wouldn't have to do anything explicit. When an instanceId which had scheduled actions on it's previous turn did not schedule actions on it's current turn, we would treat this instanceId as resolved, and schedule the actions for the "resolved" action group. Every alert would have this action group available, for free. Raises the following questions:
|
Stack Monitoring is doing this right now, but it's not the solution we want long term. We are detecting two separate scenarios and firing actions for both: We call Ideally, we don't have to worry about this and the alerting solution provides this for us. It feels like something every solution will want, as it helps close the loop on the fixing the underlying issue causing the alert. |
@chrisronline Did y'all figure out how to handle recovery alerts when a renotify period is specified? I'm trying to implement these for Metrics right now, and I can only get it to work without a renotify period. When I do specify one:
I'm not sure how to get around this without full support for multiple action groups, which I believe is tracked in #64077 |
Spoke to @chrisronline and apparently they don't have a solution. Right now a single action group has to do double-duty for both Fired and Recovered alerts, and the alerts can get swallowed if throttling is turned on. We definitely need core support from the plugin for this. |
Thanks for the input @Zacqary. We're aiming to have a built-in solution in 7.9, stay tuned! |
Any update on this? Still have 7.9 work in Metrics UI blocked by this so I'm wondering if y'all have a timetable |
@Zacqary The best update I have is aiming for late 7.9 or 7.10 at this point. |
We have received lots of uptime alerts users requests for this, i appreciate this being prioritised. |
Any update on this issue? Did it make 7.9? Or will it be 7.10? Or later. |
Hi @justinfiore, this is planned to be part of the 7.11 release and to be worked on starting soon. Stay tuned! This issue will rely on the UI work done in #64077 and add a "resolved" group to all types of alerts. |
I'm wondering if some types of alerts would like to opt-out of this functionality if ever it doesn't make sense to them? Example: cc @YulNaumenko |
response from some questions I asked here #49405 (comment), as it relates to an initial implementation:
I think, no, for now. I think we could prevent this, but if we can't, that's fine for now as well.
Not now. And a response to a question from @mikecote in #49405 (comment)
I raised a question in PR #82645 (comment), regarding what events should be generated when an alert "switches" action groups while it's active. One option, is that it might "resolve" with the old group and then new-instance on the new group. I think maybe we need to answer that before wondering whether opt-out is something to consider. But in any case, it would be opt-out at the alert type, you're thinking, right? So seems like it would be easy to add this later if we needed to. |
@YulNaumenko were just chatting about what "context" variables would be available for the resolved action group. Answer (today): none (I think). I hope we can make all the other variables available, except the kibana/x-pack/plugins/alerts/server/task_runner/transform_action_params.ts Lines 45 to 56 in 26f79a6
I'd hate to have to do more GETs to get the variable values for the bits listed there (including the In theory, we could probably arrange for the alert executors to make the context variables available in such a way that we could get them for the resolved action execution, but it would mean a change to the way the context is set in the alert type, so I'm hesitant to make a change like that now. Eg, a new API in the |
That's correct and agree it's something we can handle later. Looks like we can discuss it in #82792 🙂
I think I'm +1 on skipping |
I have a Draft PR where introduced the solution without the possibility to use the alert execution |
Yeah, I'm fine with this approach for now, don't see how we can get the context vars in there. In terms of the UI, we don't actually have anything that prevents people from entering invalid variable names now, right? But we do have the list of variables that pops up, that can be inserted via selecting them. Can we have the UI identify that it's editing an action in the "resolved" group, and not list those variables in the pop-up? Absolute worst-case, since I think the plan is to somehow provide a link to doc for the available mustache variables per alert, we could note in that doc that the
The geo-tracking alert? My understanding is that it will NEVER be in a resolved state, it would always be in an "inside" or "outside" action group. But they could perhaps change the design so that the customer selects if they want to run actions on "inside" or "outside", and when the alert "crosses the line" from the customer selected "side" to the other one, we'd fire the resolved group. I'm not seeing any issues with this alert regarding the resolved action group, but maybe I'm missing something ... |
@arisonl In order for the auto-closing a PD incident when an alert is resolved, do users needs to do something in rule configuration? If not, great. If so, is that documented? |
@mukeshelastic users will need to do something in the rule configuration in order to make PD incidents close when an alert recovers. We've made the steps a few less clicks with this PR where the user just needs to add another PD action that "Runs When: Recovered" and the form settings will be populated with default values that resolve the incident. The PR has a |
@mukeshelastic As Mike mentions, users will need to specify on the UI that they want the PD incident to be resolved when the alert recovers. This will be documented but we also plan to go beyond documentation by releasing additional "how-to" content around actions and integrations. |
The has been a lot of requests for users wanting to be notified whenever an alert instance is no longer considered "active". The determination of this is done whenever the event log logs a
resolved-instance
event here.This issue should add a static action group called
resolved
that is available to all alert types and allows alert actions to attach themselves to it. This action group should fire using the same logic used to logresolved-instance
and should leverage the UI work done in #64077 to add a "Resolved" option to the group dropdown.NOTE: Some UI work will be blocked until #64077 is finished, the server side work can be done in parallel.
Steps:
Original description
One option is to create some sort of "resolved" action group. Alert actions can assign themselves to this new action group. The new action group would fire when an alert instance stops firing (considered resolved when nothing to alert on anymore).
Another option may be to give the alert type executor the tools it needs to manually mark alert instances as resolved.
Note for docs
User guides will need to be updated for this new feature. Maybe the PagerDuty connector docs should also be updated to indicate how to "resolve" a PagerDuty incident when the alert has recovered.
The text was updated successfully, but these errors were encountered: