-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[searchService] Dedupe shard error toasts #131776
Conversation
Pinging @elastic/kibana-app-services (Team:AppServicesSv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this came up a couple times (the issue in the description has a wrong link i think) but i think this should be solved on the notifications service level and not per specific errors (like shard error in this case) cc @lukasolson
It also seems we'll start to go in the direction of actual consumers handling this errors. They are the only ones who can be smarter about the error that is shown to the user. cc @flash1293
Given how often this came up already and then the discussion got stalled I'm very much in favor of a "quick fix" that at least mitigates the issue while we figure out to solve the bigger problem - depending on app and screen size used this can be very distracting (and even make certain apps impossible to use effectively) |
That would require agreement with core which we haven't made much progress on. I have mixed feelings about that route. It would certainly be nice to fix this at the service level but I'm not completely certain that we want to debounce the same way in all cases. An intermediate step could be creating a wrapper around the notification service which also debounces. It would require a bit more effort but it wouldn't be stuck on getting agreement from the service owner.
I agree thats a better design but its also a much larger project. We'd need a transition plan and careful accounting of consumer apps to make sure errors aren't swallowed. Overall I agree with @ppisljar but I'd like to deliver something for our customers now while we resolve our internal alignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm wondering if we can be a little more intelligent about how we dedupe these error messages instead of only debouncing every 5 seconds.
First of all, we should only be deduping if the error messages themselves are the same error message (this should be inside response.rawResponse
, which is passed to the ShardFailureOpenModalButton
).
Second, we really should only be deduping on each "load" or "refresh" of the data. Perhaps we can reuse sessionId
inside ISearchOptions
to accomplish this (or generate a random uuid if search sessions are disabled). We will have to pass the options
object back to handleResponse
inside SearchSource
(and wherever else handleResponse
is used) since it isn't currently passed to this utility.
@lukasolson Sounds good. I think I've gotten ahead of things with some poor assumptions. I'd like to slow down and make sure this is done correctly. I could use more info on the potential types of shard errors. I'm familiar with the scripting case but nothing else. I'm also wondering if titling toasts with 'x of y shards failed' is useful. It might be useful to surface more specific info, such as 'scripting error'. I have a similar PR for the data view layer. I think it avoids the specifics we're discussing here - #131779 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused as to the debouncing here, but the deduping looks good. Just looked at the code so far, haven't actually tested.
@@ -21,33 +23,56 @@ import { | |||
} from '../types'; | |||
import { extractWarnings } from './extract_warnings'; | |||
|
|||
const getDebouncedWarning = () => { | |||
const addWarning = () => debounce(getNotifications().toasts.addWarning, 10000, { leading: true }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that regardless of the deduping, we'll still only show a max of one warning every 10 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The debounce is applied for a given debounceKey
. Different debounceKeys within 10s would result in multiple warnings. The same debounceKey within 10s will result in only one warning, past 10s an additional warning is shown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a much needed improvement, thank you! Played around with it locally with different types of warnings/errors and things seemed to dedupe correctly while still showing errors of different types.
My only suggestion would be to increase the 10s debounce to something like 30s or 1m. I've already run into the situation multiple times where that 10s wasn't enough. Maybe we could also consider adding it to the config, but not totally necessary for this PR.
💚 Build Succeeded
Metrics [docs]Page load bundle
History
To update your PR or re-run it, just comment with: |
* dedupe shard error toasts Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Summary
Shard error toasts can appear in overwhelming numbers if several queries are performed at the same time and break in the same way. Displaying many shard errors doesn't help the user resolve the problem so lets only display the first one for a given sessionId. All errors without a sessionId will be treated as having the same sessionId.
The easiest way I found to reproduce this problem was by modifying a functional test -
x-pack/test/functional/apps/discover/async_scripted_fields.js:70
- add an extra zero to theretryForTime
value.5 of 3
- so the check breaks.Part of #116686
Release note -
Shard failure notifications have been reduced when many queries fail at the same time.