-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Have finer-grained error handling for errors during /api/fleet/setup #91864
Comments
Pinging @elastic/fleet (Feature:Fleet) |
Pinging @elastic/fleet (Team:Fleet) |
I think (1) will lead to more user-facing errors and new failure modes. EPM has always had the requirement that all side effects from startup were successfully completed. Each service assumes/requires that everything from startup is in place. If we start with that, it should get some thorough testing, even if just manual. In addition to more fine-grained errors (possibly before them) we'll want finer-grained status tracking as well. I mentioned something about this in #70008
And started #70333 to track it. My current thinking is to track each of the items from kibana/x-pack/plugins/fleet/server/services/setup.ts Lines 65 to 69 in 9870ade
and allow services to declare/wait on the parts they need. e.g ensure the default packages have been installed, or default agent policy exists, etc. |
IIRC this is where most of the reports of a blocked Fleet UI have been sighted. I would like to suggest that we start diving into this specific problem of failed default package upgrades first, before discussing the right error handling mechanism for everything IMO failed upgrades of default packages should not block any parts of the UI. The older package version was already installed, had been working, and upgrading it does not actually apply it to any existing agent policy, so there's no reason to treat it as a fatal error. Then the question is, what scenarios have we encountered so far where a default package upgrade failed? Here are some that come to mind, though I'm sure we've encountered others:
During my investigation for #89436, I recall that the underlying problem is that when Fleet encounters an error on upgrading a default package, it will rollback to the previous version of the package, but Even nastier, the rollback itself could fail 😬 (such was the case in #89436). It could be that the upgrade was partially applied, encountered an error, backed off to rollback, but then encountered a conflict applying the older version of some installable on top of the new version that got installed from the partial upgrade. Then we have a package in limbo: no version is reported as fully installed and Here is the code for handling package installation failures, you can see that for I see these being our next steps:
|
One clarification about (3). We have a distinction between default and required packages kibana/x-pack/plugins/fleet/common/constants/epm.ts Lines 15 to 22 in fe35e0d
Default are installed on setup and required packages cannot be deleted. I'll try to find the discussion from when we added required to block removal so we have that context |
I remember the discussion. Since then there were multiple occasions where Fleet was unusable because of a package update problem on |
Thought of this ticket or my comment about permissions while reading #93051 (comment) |
Done by #97404 (thank you Sonja |
Currently, the
/api/fleet/setup
endpoint is called whenever the Fleet UI is opened, and it attempts to update the mandatory packagesendpoint
andsystem
. If an error occurs during these updates, the Fleet UI is blocked and can't be used at all.This issue is to change this, and instead show a very prominent warning when an error occured during setup, but still also show the standard Fleet UI so that users can continue exploring their Fleet setup.
I would propose the following changes:
initializationError
here, or only have aninitializationError
for really blocking failure (that would need to be defined) https://github.com/elastic/kibana/blob/master/x-pack/plugins/fleet/public/applications/fleet/app.tsx#L153/api/fleet/setup
return non-blocking errors as part of its normal response and only return an error on really blocking errors (see above)FleetStatusProvider
, or a yet-to-create status provider to the UI(1) A first implementation step could be to treat all errors that currently block the UI as non-blocking, to make the Fleet UI more friendly to users.
(2) A next step would then be to classify the errors, disable parts of the UI conditionally if certain errors are present (if necessary), and add more specific help text based on the kind of error. Example: the
transform
errors described in #91570 have a very specific cause that can (IIUC) only be fixed by changing the cluster configuration.(3) Another step would be to identify errors that still should block the Fleet UI completely, because no sensible actions are possible anyway, or no data can be shown. (This would be similar to how we currently block the UI when the fleet user role hasn't been created yet.)
I think (1) could be implemented fairly quickly, but I don't know if it improves the situation at all if we haven't thought about (2). Showing the Fleet UI is not helpful if it is broken because of a missing setup step. How likely is this to happen?
@jen-huang @kevinlog @nchaulet I would be most interested in your thoughts, please comment!
The text was updated successfully, but these errors were encountered: