Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Have finer-grained error handling for errors during /api/fleet/setup #91864

Closed
skh opened this issue Feb 18, 2021 · 10 comments
Closed

[Fleet] Have finer-grained error handling for errors during /api/fleet/setup #91864

skh opened this issue Feb 18, 2021 · 10 comments
Assignees
Labels
Feature:Fleet Fleet team's agent central management project Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@skh
Copy link
Contributor

skh commented Feb 18, 2021

Currently, the /api/fleet/setup endpoint is called whenever the Fleet UI is opened, and it attempts to update the mandatory packages endpoint and system. If an error occurs during these updates, the Fleet UI is blocked and can't be used at all.

This issue is to change this, and instead show a very prominent warning when an error occured during setup, but still also show the standard Fleet UI so that users can continue exploring their Fleet setup.

I would propose the following changes:

  • do not check for initializationError here, or only have an initializationError for really blocking failure (that would need to be defined) https://github.com/elastic/kibana/blob/master/x-pack/plugins/fleet/public/applications/fleet/app.tsx#L153
  • have /api/fleet/setup return non-blocking errors as part of its normal response and only return an error on really blocking errors (see above)
  • make the non-blocking errors available through FleetStatusProvider, or a yet-to-create status provider to the UI
  • show the non-blocking errors very prominently on every page in Fleet UI, and maybe some helpful text how to fix them

(1) A first implementation step could be to treat all errors that currently block the UI as non-blocking, to make the Fleet UI more friendly to users.

(2) A next step would then be to classify the errors, disable parts of the UI conditionally if certain errors are present (if necessary), and add more specific help text based on the kind of error. Example: the transform errors described in #91570 have a very specific cause that can (IIUC) only be fixed by changing the cluster configuration.

(3) Another step would be to identify errors that still should block the Fleet UI completely, because no sensible actions are possible anyway, or no data can be shown. (This would be similar to how we currently block the UI when the fleet user role hasn't been created yet.)


I think (1) could be implemented fairly quickly, but I don't know if it improves the situation at all if we haven't thought about (2). Showing the Fleet UI is not helpful if it is broken because of a missing setup step. How likely is this to happen?

@jen-huang @kevinlog @nchaulet I would be most interested in your thoughts, please comment!

@skh skh added Feature:Fleet Fleet team's agent central management project Team:Fleet Team label for Observability Data Collection Fleet team labels Feb 18, 2021
@skh skh self-assigned this Feb 18, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Feature:Fleet)

@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@jfsiii
Copy link
Contributor

jfsiii commented Feb 18, 2021

I think (1) will lead to more user-facing errors and new failure modes. EPM has always had the requirement that all side effects from startup were successfully completed. Each service assumes/requires that everything from startup is in place. If we start with that, it should get some thorough testing, even if just manual.

In addition to more fine-grained errors (possibly before them) we'll want finer-grained status tracking as well.

I mentioned something about this in #70008

We likely want to eventually split the status up into more granular settings like "hasDefaultPackages", "hasDefaultOutput", "hasDefaultConfig", etc but I think this works

And started #70333 to track it. My current thinking is to track each of the items from createSetupSideEffects

ensureInstalledDefaultPackages(soClient, callCluster),
outputService.ensureDefaultOutput(soClient),
agentPolicyService.ensureDefaultAgentPolicy(soClient, esClient),
updateFleetRoleIfExists(callCluster),
settingsService.getSettings(soClient).catch((e: any) => {

and allow services to declare/wait on the parts they need. e.g ensure the default packages have been installed, or default agent policy exists, etc.

@jen-huang
Copy link
Contributor

jen-huang commented Feb 19, 2021

it attempts to update the mandatory packages endpoint and system. If an error occurs during these updates, the Fleet UI is blocked and can't be used at all.

IIRC this is where most of the reports of a blocked Fleet UI have been sighted. I would like to suggest that we start diving into this specific problem of failed default package upgrades first, before discussing the right error handling mechanism for everything /setup does. I'm not saying we shouldn't (I like both @skh and @jfsiii's proposals) but package upgrade failure is where we've seen the highest severity errors and the devil is in the details in this area, so let's dive into what's going on there first 👀

IMO failed upgrades of default packages should not block any parts of the UI. The older package version was already installed, had been working, and upgrading it does not actually apply it to any existing agent policy, so there's no reason to treat it as a fatal error.

Then the question is, what scenarios have we encountered so far where a default package upgrade failed? Here are some that come to mind, though I'm sure we've encountered others:

During my investigation for #89436, I recall that the underlying problem is that when Fleet encounters an error on upgrading a default package, it will rollback to the previous version of the package, but /setup will always try to upgrade the package again, effectively getting stuck in a upgrade->rollback->upgrade loop.

Even nastier, the rollback itself could fail 😬 (such was the case in #89436). It could be that the upgrade was partially applied, encountered an error, backed off to rollback, but then encountered a conflict applying the older version of some installable on top of the new version that got installed from the partial upgrade. Then we have a package in limbo: no version is reported as fully installed and /setup times out.

Here is the code for handling package installation failures, you can see that for installs or reinstalls, we simply attempt to wipe the installation if it fails. In the case of updates (upgrades) failures, we trigger a rollback, which actually appears to trigger just a simple installation of the previous version, but this code path is probably a bit hairier than that. I wonder if this goes into a loop (or some other under long-running process) at some point if the rollback itself fails, causing the /setup timeout? Our error handling changes will be for naught if we can't even get a proper response back! 😉

I see these being our next steps:

  1. Assuming rollback is successful, go with @skh's proposal return non-blocking error information to the UI to let the user know that we weren't able to upgrade X package(s)
    • Should we add some sort of persistent flag (maybe even just in-memory?) to know which package versions failed to upgrade, so that we don't retry the upgrade again on every /setup call? If we don't, this will cause long loading times for the user on every Fleet load until when & if the package ever gets upgraded successfully.
  2. Assuming rollback is not successful, well.. ideally, this shouldn't get to happen 🙃 Last known working version should be just that. More investigation is needed into what happens today if rollback fails to see what we can do to make sure that doesn't happen (do we need to "freeze" the old version better? or wipe the partially upgraded version better? can we do a "dry run" upgrade attempt instead?).
    • Frankly I'm not sure what error handling we should do if the rollback itself ends up failing. Show non-blocking UI like above and introduce a new package installations status akin to "something went terribly wrong here"??
  3. We're currently very strict on not allowing default packages to be deleted from the UI or API. Let's at least modify the delete API to have a {force: true} param (similar to the param to force installation of older versions) to allow default packages to be deleted. This will give us a debugging tool and an escape hatch for live deployments.
  4. Add tests that mock upgrade/rollback failures and provides coverage of what we expect /setup to return for those scenarios.

@jfsiii
Copy link
Contributor

jfsiii commented Feb 19, 2021

One clarification about (3). We have a distinction between default and required packages

export const requiredPackages = {
System: 'system',
Endpoint: 'endpoint',
ElasticAgent: 'elastic_agent',
} as const;
// these are currently identical. we can separate if they later diverge
export const defaultPackages = requiredPackages;

Default are installed on setup and required packages cannot be deleted.

I'll try to find the discussion from when we added required to block removal so we have that context

@skh
Copy link
Contributor Author

skh commented Feb 22, 2021

Default are installed on setup and required packages cannot be deleted.

I'll try to find the discussion from when we added required to block removal so we have that context

I remember the discussion. Since then there were multiple occasions where Fleet was unusable because of a package update problem on system or endpoint, and a forced uninstall and manual reinstall would have been a welcome option in the troubleshooting process. So I agree with @jen-huang that we should reassess that decision.

@ph
Copy link
Contributor

ph commented Feb 25, 2021

@skh @jfsiii @nchaulet @ruflin Do we have a conscensus around the discussion? What are we missing to make progress?

@jfsii
Copy link

jfsii commented Feb 25, 2021

@skh @jfsii @nchaulet @ruflin Do we have a conscensus around the discussion? What are we missing to make progress?

@jfsiii ^ one more i

@jfsiii
Copy link
Contributor

jfsiii commented Mar 5, 2021

Thought of this ticket or my comment about permissions while reading #93051 (comment)

@jen-huang
Copy link
Contributor

jen-huang commented Apr 28, 2021

Done by #97404 (thank you Sonja ☺️).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Fleet Fleet team's agent central management project Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

6 participants