[Fleet] Have finer-grained error handling for errors during /api/fleet/setup #91864

skh · 2021-02-18T15:43:34Z

Currently, the /api/fleet/setup endpoint is called whenever the Fleet UI is opened, and it attempts to update the mandatory packages endpoint and system. If an error occurs during these updates, the Fleet UI is blocked and can't be used at all.

This issue is to change this, and instead show a very prominent warning when an error occured during setup, but still also show the standard Fleet UI so that users can continue exploring their Fleet setup.

I would propose the following changes:

do not check for initializationError here, or only have an initializationError for really blocking failure (that would need to be defined) https://github.com/elastic/kibana/blob/master/x-pack/plugins/fleet/public/applications/fleet/app.tsx#L153
have /api/fleet/setup return non-blocking errors as part of its normal response and only return an error on really blocking errors (see above)
make the non-blocking errors available through FleetStatusProvider, or a yet-to-create status provider to the UI
show the non-blocking errors very prominently on every page in Fleet UI, and maybe some helpful text how to fix them

(1) A first implementation step could be to treat all errors that currently block the UI as non-blocking, to make the Fleet UI more friendly to users.

(2) A next step would then be to classify the errors, disable parts of the UI conditionally if certain errors are present (if necessary), and add more specific help text based on the kind of error. Example: the transform errors described in #91570 have a very specific cause that can (IIUC) only be fixed by changing the cluster configuration.

(3) Another step would be to identify errors that still should block the Fleet UI completely, because no sensible actions are possible anyway, or no data can be shown. (This would be similar to how we currently block the UI when the fleet user role hasn't been created yet.)

I think (1) could be implemented fairly quickly, but I don't know if it improves the situation at all if we haven't thought about (2). Showing the Fleet UI is not helpful if it is broken because of a missing setup step. How likely is this to happen?

@jen-huang @kevinlog @nchaulet I would be most interested in your thoughts, please comment!

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-02-18T15:43:36Z

Pinging @elastic/fleet (Feature:Fleet)

elasticmachine · 2021-02-18T15:43:36Z

Pinging @elastic/fleet (Team:Fleet)

jfsiii · 2021-02-18T16:58:54Z

I think (1) will lead to more user-facing errors and new failure modes. EPM has always had the requirement that all side effects from startup were successfully completed. Each service assumes/requires that everything from startup is in place. If we start with that, it should get some thorough testing, even if just manual.

In addition to more fine-grained errors (possibly before them) we'll want finer-grained status tracking as well.

I mentioned something about this in #70008

We likely want to eventually split the status up into more granular settings like "hasDefaultPackages", "hasDefaultOutput", "hasDefaultConfig", etc but I think this works

And started #70333 to track it. My current thinking is to track each of the items from createSetupSideEffects

kibana/x-pack/plugins/fleet/server/services/setup.ts

Lines 65 to 69 in 9870ade

    
           ensureInstalledDefaultPackages(soClient, callCluster), 
        
           outputService.ensureDefaultOutput(soClient), 
        
           agentPolicyService.ensureDefaultAgentPolicy(soClient, esClient), 
        
           updateFleetRoleIfExists(callCluster), 
        
           settingsService.getSettings(soClient).catch((e: any) => {

and allow services to declare/wait on the parts they need. e.g ensure the default packages have been installed, or default agent policy exists, etc.

jen-huang · 2021-02-19T21:04:19Z

it attempts to update the mandatory packages endpoint and system. If an error occurs during these updates, the Fleet UI is blocked and can't be used at all.

IIRC this is where most of the reports of a blocked Fleet UI have been sighted. I would like to suggest that we start diving into this specific problem of failed default package upgrades first, before discussing the right error handling mechanism for everything /setup does. I'm not saying we shouldn't (I like both @skh and @jfsiii's proposals) but package upgrade failure is where we've seen the highest severity errors and the devil is in the details in this area, so let's dive into what's going on there first 👀

IMO failed upgrades of default packages should not block any parts of the UI. The older package version was already installed, had been working, and upgrading it does not actually apply it to any existing agent policy, so there's no reason to treat it as a fatal error.

Then the question is, what scenarios have we encountered so far where a default package upgrade failed? Here are some that come to mind, though I'm sure we've encountered others:

EPR is unavailable, as in [Ingest Manager][EPM] During a package upgrade failure the ingest manger UI is blocked #79560
Some sort of ES error during package upgrade
- Issue with rolling over data streams, as in [Fleet] Datastream rollovers for an integration do not rollover with the right namespace #89436
- Issue with stopping transforms, as in [Fleet] [Security Solution] Fleet cannot upgrade: Transform stop times out #91570

During my investigation for #89436, I recall that the underlying problem is that when Fleet encounters an error on upgrading a default package, it will rollback to the previous version of the package, but /setup will always try to upgrade the package again, effectively getting stuck in a upgrade->rollback->upgrade loop.

Even nastier, the rollback itself could fail 😬 (such was the case in #89436). It could be that the upgrade was partially applied, encountered an error, backed off to rollback, but then encountered a conflict applying the older version of some installable on top of the new version that got installed from the partial upgrade. Then we have a package in limbo: no version is reported as fully installed and /setup times out.

Here is the code for handling package installation failures, you can see that for installs or reinstalls, we simply attempt to wipe the installation if it fails. In the case of updates (upgrades) failures, we trigger a rollback, which actually appears to trigger just a simple installation of the previous version, but this code path is probably a bit hairier than that. I wonder if this goes into a loop (or some other under long-running process) at some point if the rollback itself fails, causing the /setup timeout? Our error handling changes will be for naught if we can't even get a proper response back! 😉

I see these being our next steps:

Assuming rollback is successful, go with @skh's proposal return non-blocking error information to the UI to let the user know that we weren't able to upgrade X package(s)
- Should we add some sort of persistent flag (maybe even just in-memory?) to know which package versions failed to upgrade, so that we don't retry the upgrade again on every /setup call? If we don't, this will cause long loading times for the user on every Fleet load until when & if the package ever gets upgraded successfully.
Assuming rollback is not successful, well.. ideally, this shouldn't get to happen 🙃 Last known working version should be just that. More investigation is needed into what happens today if rollback fails to see what we can do to make sure that doesn't happen (do we need to "freeze" the old version better? or wipe the partially upgraded version better? can we do a "dry run" upgrade attempt instead?).
- Frankly I'm not sure what error handling we should do if the rollback itself ends up failing. Show non-blocking UI like above and introduce a new package installations status akin to "something went terribly wrong here"??
We're currently very strict on not allowing default packages to be deleted from the UI or API. Let's at least modify the delete API to have a {force: true} param (similar to the param to force installation of older versions) to allow default packages to be deleted. This will give us a debugging tool and an escape hatch for live deployments.
Add tests that mock upgrade/rollback failures and provides coverage of what we expect /setup to return for those scenarios.

jfsiii · 2021-02-19T21:22:42Z

One clarification about (3). We have a distinction between default and required packages

kibana/x-pack/plugins/fleet/common/constants/epm.ts

Lines 15 to 22 in fe35e0d

    
           export const requiredPackages = { 
        
             System: 'system', 
        
             Endpoint: 'endpoint', 
        
             ElasticAgent: 'elastic_agent', 
        
           } as const; 
        
           // these are currently identical. we can separate if they later diverge 
        
           export const defaultPackages = requiredPackages;

Default are installed on setup and required packages cannot be deleted.

I'll try to find the discussion from when we added required to block removal so we have that context

skh · 2021-02-22T13:16:07Z

Default are installed on setup and required packages cannot be deleted.

I'll try to find the discussion from when we added required to block removal so we have that context

I remember the discussion. Since then there were multiple occasions where Fleet was unusable because of a package update problem on system or endpoint, and a forced uninstall and manual reinstall would have been a welcome option in the troubleshooting process. So I agree with @jen-huang that we should reassess that decision.

ph · 2021-02-25T20:29:07Z

@skh @jfsiii @nchaulet @ruflin Do we have a conscensus around the discussion? What are we missing to make progress?

jfsii · 2021-02-25T20:40:03Z

@skh @jfsii @nchaulet @ruflin Do we have a conscensus around the discussion? What are we missing to make progress?

@jfsiii ^ one more i

jfsiii · 2021-03-05T12:43:19Z

Thought of this ticket or my comment about permissions while reading #93051 (comment)

jen-huang · 2021-04-28T17:42:29Z

Done by #97404 (thank you Sonja ☺️).

skh added Feature:Fleet Fleet team's agent central management project Team:Fleet Team label for Observability Data Collection Fleet team labels Feb 18, 2021

skh self-assigned this Feb 18, 2021

skh mentioned this issue Feb 18, 2021

[Ingest Manager][EPM] During a package upgrade failure the ingest manger UI is blocked #79560

Closed

skh mentioned this issue Feb 22, 2021

[Fleet] Add force option to DELETE package endpoint. #92174

Closed

9 tasks

kevinlog mentioned this issue Mar 2, 2021

[Fleet] [Security Solution] Fleet cannot upgrade: Transform stop times out #91570

Closed

jen-huang mentioned this issue Mar 17, 2021

[Fleet] Configure Fleet packages and integrations through config file #88956

Closed

jfsiii mentioned this issue Mar 18, 2021

[Fleet] Handle required packages failing during installation #82745

Open

This was referenced Mar 22, 2021

[Fleet] Add force option to DELETE package endpoint. #95051

Merged

[Fleet] Finer-grained error information from install/upgrade API #95649

Merged

This was referenced Apr 18, 2021

[Fleet] Don't fail on errors in 'update' or 'reupdate' operation in /setup #97402

Closed

[Fleet] Don't fail on errors in 'update' or 'reupdate' operation in /setup #97404

Merged

jen-huang unassigned skh Apr 23, 2021

jen-huang closed this as completed Apr 28, 2021

jen-huang assigned skh Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Have finer-grained error handling for errors during /api/fleet/setup #91864

[Fleet] Have finer-grained error handling for errors during /api/fleet/setup #91864

skh commented Feb 18, 2021

elasticmachine commented Feb 18, 2021

elasticmachine commented Feb 18, 2021

jfsiii commented Feb 18, 2021

jen-huang commented Feb 19, 2021 •

edited

Loading

jfsiii commented Feb 19, 2021

skh commented Feb 22, 2021

ph commented Feb 25, 2021 •

edited by jfsiii

Loading

jfsii commented Feb 25, 2021

jfsiii commented Mar 5, 2021

jen-huang commented Apr 28, 2021 •

edited

Loading

[Fleet] Have finer-grained error handling for errors during /api/fleet/setup #91864

[Fleet] Have finer-grained error handling for errors during /api/fleet/setup #91864

Comments

skh commented Feb 18, 2021

elasticmachine commented Feb 18, 2021

elasticmachine commented Feb 18, 2021

jfsiii commented Feb 18, 2021

jen-huang commented Feb 19, 2021 • edited Loading

jfsiii commented Feb 19, 2021

skh commented Feb 22, 2021

ph commented Feb 25, 2021 • edited by jfsiii Loading

jfsii commented Feb 25, 2021

jfsiii commented Mar 5, 2021

jen-huang commented Apr 28, 2021 • edited Loading

jen-huang commented Feb 19, 2021 •

edited

Loading

ph commented Feb 25, 2021 •

edited by jfsiii

Loading

jen-huang commented Apr 28, 2021 •

edited

Loading