-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] Index threshold: Actions not fired as expected #84335
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Hi @felix-lessoer I have a couple of questions to help debug the issue:
|
Check every is set to 1h
Notify every is set to 1h
The Alert Details page showed that there was a new alert instance at about
13:xx And duration was during that time only at 10 min. So it hast created
a new instance. But the condition was always true.
The user is only running 1 Kibana alert but several Watchers.
|
We've merged a fix in 7.10 to display the true time the alert instance was created (PR: #68437). Prior to 7.10, it only shows the time from the previous execution.
It does seem strange to see the logs not 1hr apart. Do the queries take a while to run? (say longer than 5 minutes?). Is there any logs in Kibana that could help find out the reason for this? (either alerting or task manager related). |
7.9.x still has the "zombie" tasks issue, but I'm guessing this wouldn't be a problem with only a single alert in the system - though I don't know how many other tasks we have running - I think it's still under 10. If it was over 10, it's potentially a zombie task case.
That doesn't seem right, with the "FOR THE LAST" set to 24 hours - lots of overlap with previous runs. Sorry for the late request, but any chance we could more of the event log records besides It's possible that it's considered "new" because the event log search didn't go back far enough, but I thought that search would go back some multiplier on the alert interval (so at least two hours). Also interesting that the execution times seem to be 1 hour + some multiple (0-3) of five minutes; 5 minutes is our error retry? I believe errors encountered before the alert execution starts will NOT be logged in the event log, until 7.11. But there could be something in the Kibana log about these ... |
Thanks. I will try to collect more info. |
This is the full event log. There was no other entry. |
Good call @pmuellr. After some local investigation with 7.9.3, it looks like an executor failure with the alert will be handled and logged appropriately in the event log and the alert will reschedule for its next scheduled interval. That doesn't look like what's happening here. The alert will be retried in 5 minutes when there is an error decrypting attributes, fetching an api key, or getting services with user permissions. Unfortunately, it doesn't look like we're doing any logging to narrow down which one of these functions are failing. This is the PR that added the fallback. This is the relevant function: kibana/x-pack/plugins/alerts/server/task_runner/task_runner.ts Lines 295 to 323 in 941c66f
Do you know if there is one of these functions that could potentially fail one or two times and then succeed? |
We will be logging these in the event log anyway, in 7.11, via #82401 I wonder if we should log these in the Kibana log as well ...
Getting 502/504's from Kibana + ES requests, from cloud, is pretty common. Presumably a retry could work. I suspect if we look at the sum total of all the requests taking place during alert execution, there's going to be a lot of them. Running with APM instrumented is one way to see them. Anyhoo, I'm wondering if we're hitting some of these during the "inner" requests being made, and there's no retry logic happening - I'm not sure if there's supposed to be or not, TBH. I think the "new" ES client library has built-in support for retries, but not sure what it determines whether something is retry-able - I would hope a 502/504 would be. In this particular case, it's not cloud, but on prem, so I'd think the chances of 502/504's is going to be zero. However, it's still possible that we could be overloading something and causing similar sorts of effects. @felix-lessoer did you have a repro of this that we could try on the main branch (what we'll be shipping in 7.11?) |
Well the test was: 1.) Loading data with Beats, e.g. Metricbeat |
Thanks @felix-lessoer. @mikecote another thing to add to our perf tests is to capture error counts. If the sort of thing Felix is seeing is happening at that relatively slow scale, it seems like the perf tests might also see it (I'd expect even more, actually, do to the sheer number of requests taking place. I don't believe we are tracking any errors from the event log today, in the test runner. |
👌 Good point, I've added an item in #40264 to capture this information. |
@felix-lessoer Any luck getting the event and Kibana logs for this? |
I was asking the customer but it looks like after upgrading to 7.10.1 the issue do not exist anymore. So we also dont have fresh logs. |
Are we good to close this issue then? If it crops up again, we can open a new issue? |
Yes lets close it
ymao1 <notifications@github.com> schrieb am Mo., 21. Dez. 2020, 19:50:
… Are we good to close this issue then? If it crops up again, we can open a
new issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#84335 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AILR4F7OZOU5E4JNGSQSIIDSV6KF3ANCNFSM4UCPWR7A>
.
|
Kibana version:
7.9.3
Elasticsearch version:
7.9.3
Original install method (e.g. download page, yum, from source, etc.):
on Prem
Describe the bug:
User has set up an alert that should always fire.
But the re notifying is not consitent and also the alert has been marked as resolved but there was no resolution.
Steps to reproduce:
The timing is not correct. Sometimes it takes longer than 1h to get re notified
Expected behavior:
The Re notify happens every hour at the same time
Screenshots (if relevant):
The text was updated successfully, but these errors were encountered: