Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[searchService] Dedupe shard error toasts #131776

Merged
merged 31 commits into from
Oct 18, 2022

Conversation

mattkime
Copy link
Contributor

@mattkime mattkime commented May 7, 2022

Summary

Shard error toasts can appear in overwhelming numbers if several queries are performed at the same time and break in the same way. Displaying many shard errors doesn't help the user resolve the problem so lets only display the first one for a given sessionId. All errors without a sessionId will be treated as having the same sessionId.

The easiest way I found to reproduce this problem was by modifying a functional test -

  • x-pack/test/functional/apps/discover/async_scripted_fields.js:70 - add an extra zero to the retryForTime value.
  • line 74 - check for 5 of 3 - so the check breaks.
  • when the test runs, you'll see a shard error toast. On main, if you set the timestamp date back a year you should see a second shard error toast. On this branch you can see exactly the same thing but only get one toast.

Part of #116686

Release note -

Shard failure notifications have been reduced when many queries fail at the same time.

@mattkime mattkime added v8.3.0 Feature:Data Views Data Views code and UI - index patterns before 8.0 Team:AppServicesSv release_note:fix Feature:Search Querying infrastructure in Kibana and removed Feature:Data Views Data Views code and UI - index patterns before 8.0 labels May 8, 2022
@mattkime mattkime changed the title dedupe shard error toasts [searchService] Dedupe shard error toasts May 8, 2022
@mattkime mattkime marked this pull request as ready for review May 9, 2022 21:15
@mattkime mattkime requested a review from a team as a code owner May 9, 2022 21:15
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-services (Team:AppServicesSv)

@mattkime mattkime requested a review from sixstringcode May 9, 2022 22:43
Copy link
Member

@ppisljar ppisljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this came up a couple times (the issue in the description has a wrong link i think) but i think this should be solved on the notifications service level and not per specific errors (like shard error in this case) cc @lukasolson

It also seems we'll start to go in the direction of actual consumers handling this errors. They are the only ones who can be smarter about the error that is shown to the user. cc @flash1293

@flash1293
Copy link
Contributor

Given how often this came up already and then the discussion got stalled I'm very much in favor of a "quick fix" that at least mitigates the issue while we figure out to solve the bigger problem - depending on app and screen size used this can be very distracting (and even make certain apps impossible to use effectively)

@mattkime
Copy link
Contributor Author

@ppisljar

but i think this should be solved on the notifications service level and not per specific errors

That would require agreement with core which we haven't made much progress on. I have mixed feelings about that route. It would certainly be nice to fix this at the service level but I'm not completely certain that we want to debounce the same way in all cases. An intermediate step could be creating a wrapper around the notification service which also debounces. It would require a bit more effort but it wouldn't be stuck on getting agreement from the service owner.

It also seems we'll start to go in the direction of actual consumers handling this errors.

I agree thats a better design but its also a much larger project. We'd need a transition plan and careful accounting of consumer apps to make sure errors aren't swallowed.

Overall I agree with @ppisljar but I'd like to deliver something for our customers now while we resolve our internal alignment.

Copy link
Member

@lukasolson lukasolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm wondering if we can be a little more intelligent about how we dedupe these error messages instead of only debouncing every 5 seconds.

First of all, we should only be deduping if the error messages themselves are the same error message (this should be inside response.rawResponse, which is passed to the ShardFailureOpenModalButton).

Second, we really should only be deduping on each "load" or "refresh" of the data. Perhaps we can reuse sessionId inside ISearchOptions to accomplish this (or generate a random uuid if search sessions are disabled). We will have to pass the options object back to handleResponse inside SearchSource (and wherever else handleResponse is used) since it isn't currently passed to this utility.

@mattkime
Copy link
Contributor Author

mattkime commented May 11, 2022

@lukasolson Sounds good. I think I've gotten ahead of things with some poor assumptions. I'd like to slow down and make sure this is done correctly.

I could use more info on the potential types of shard errors. I'm familiar with the scripting case but nothing else.

I'm also wondering if titling toasts with 'x of y shards failed' is useful. It might be useful to surface more specific info, such as 'scripting error'.


I have a similar PR for the data view layer. I think it avoids the specifics we're discussing here - #131779

@mattkime mattkime added v8.6.0 and removed v8.3.0 labels Oct 11, 2022
Copy link
Member

@lukasolson lukasolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused as to the debouncing here, but the deduping looks good. Just looked at the code so far, haven't actually tested.

@@ -21,33 +23,56 @@ import {
} from '../types';
import { extractWarnings } from './extract_warnings';

const getDebouncedWarning = () => {
const addWarning = () => debounce(getNotifications().toasts.addWarning, 10000, { leading: true });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that regardless of the deduping, we'll still only show a max of one warning every 10 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debounce is applied for a given debounceKey. Different debounceKeys within 10s would result in multiple warnings. The same debounceKey within 10s will result in only one warning, past 10s an additional warning is shown.

Copy link
Member

@lukasolson lukasolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a much needed improvement, thank you! Played around with it locally with different types of warnings/errors and things seemed to dedupe correctly while still showing errors of different types.

My only suggestion would be to increase the 10s debounce to something like 30s or 1m. I've already run into the situation multiple times where that 10s wasn't enough. Maybe we could also consider adding it to the config, but not totally necessary for this PR.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
data 437.1KB 437.4KB +239.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@mattkime mattkime merged commit 4d301f7 into elastic:main Oct 18, 2022
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Oct 18, 2022
kc13greiner pushed a commit to kc13greiner/kibana that referenced this pull request Oct 18, 2022
* dedupe shard error toasts

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Search Querying infrastructure in Kibana release_note:fix v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants