[Fleet] Block Kibana startup for Fleet setup completion #120616

joshdover · 2021-12-07T14:08:35Z

In order to provide a smoother upgrade experience for users of Fleet, we would like to block Kibana startup in order to install necessary ingest assets into Elasticsearch. The purpose of blocking is to provide admins a clear point in time where it's safe to start upgrading other components that depend on Fleet ingest assets being upgraded, such as Fleet Server and Elastic Agent.

In order to make this change as non-disruptive as possible, we want to ensure:

Fleet setup is reliable as possible
Fleet setup is a fast as possible

To that end, before we start blocking Kibana startup we need to complete the following tasks:

[Fleet] Fix Fleet Setup to handle concurrent calls across nodes in HA Kibana deployment #118423
[Fleet] Prevent registry network errors during Fleet setup's bulk install operation from failing the installation of bundled packages #125097
[Fleet] Remove default integrations #108456
- This allows us to improve setup performance by not installing any packages on the initial setup. Instead we'll only need to install base Fleet assets (like our shared component template and ingest pipeline), create the default Elasticsearch output, upgrade packages that UIs in depend on, such as APM, Synthetics, and Endpoint.
Allow Fleet to complete package upgrade before Kibana server is ready #108993 - @elastic/kibana-core
[Fleet] Improve performance of auto package policy upgrades #121639

The scope of this issue should make the following changes in a single PR:

Integrate with new blocking task API introduced to Kibana Core in Allow Fleet to complete package upgrade before Kibana server is ready #108993
- Add a xpack.fleet.setup.max_retries config that defaults to 5
- Attempt to run Fleet setup in this hook up to max_retries, then throw exception and crash Kibana
- Ensure that setup failures include adequate logging in default logging configuration
[Fleet] Kibana Status should be Unavailable when Fleet Setup fails #120237
Remove calls to /api/fleet/setup API from UI code

Optionally, we'd like to improve package install and upgrade performance to minimize the impact of blocking Kibana startup. While this is likely not to be considered a blocker, it may make upgrades more painful or confusing to users. The primary bottleneck in this process is in Elasticsearch, which elastic/elasticsearch#77505 may solve.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-12-07T14:08:42Z

Pinging @elastic/fleet (Team:Fleet)

joshdover · 2021-12-07T14:09:22Z

cc @elastic/kibana-core @kobelb Please let us know if there are any other items you'd like to see before we start blocking Kibana boot.

joshdover · 2022-02-10T11:43:00Z

@kpollich @nchaulet I've finished fleshing out the scope above, anything you think I'm missing for this change?

kpollich · 2022-02-10T13:20:05Z

@joshdover LGTM!

juliaElastic · 2022-02-10T13:23:22Z

@joshdover

Remove calls to /api/fleet/setup API from UI code

Does this mean setup will only run once during kibana startup?
I'm asking because ensuring preconfiguration is called during fleet setup, so if we only call it once, it makes this discussion irrelevant:
#124004

juliaElastic · 2022-02-10T13:36:02Z

How much risk is it to schedule this task at the same iteration as the prerequisite in kibana-core?

Also, what is the impact on downstream dependencies like fleet-server, e2e-testing? Do they have to change anything? E.g. they are calling Fleet setup API to check that Fleet is ready

What happens for users who start with Kibana installation without Fleet, and enable Fleet later? Is Fleet setup going to run then? We should test this in cloud/on-prem.

joshdover · 2022-02-11T11:31:18Z

How much risk is it to schedule this task at the same iteration as the prerequisite in kibana-core?

Kibana Core has this scheduled in their current sprint which ends in ~1.5 weeks and a rough plan has been agreed upon. I suspect they won't be a blocker for much longer.

Also, what is the impact on downstream dependencies like fleet-server, e2e-testing? Do they have to change anything? E.g. they are calling Fleet setup API to check that Fleet is ready

This is designed to not break the setup API. Though calling that API won't be necessary any longer (just verifying Kibana is healthy is good enough now), we won't be removing the setup API at this point (but we should consider deprecating it in a later release).

What happens for users who start with Kibana installation without Fleet, and enable Fleet later? Is Fleet setup going to run then? We should test this in cloud/on-prem.

In 8.0 we started running Fleet setup when Kibana starts up, regardless of whether or not the user is using Fleet. The change in this issue only impacts one aspect which is to block Kibana's HTTP server from serving traffic until Fleet setup has completed. This is necessary to ensure that users and orchestration layers do not upgrade Elastic Agent instances until 1st party Stack-aligned Fleet packages (eg. APM, Synthetics) are upgraded. If Elastic Agent or Fleet Server is upgraded before these packages are upgraded, new fields may be ingested with the incorrect mappings, breaking the related application UIs.

In summary, this change doesn't change what we're setting up or how we decide to do it, it only improves the robustness of the ingestion layers in the Stack.

Also worth noting that since we removed the default policies, Fleet setup does not install any packages or create agent policies in the default self-managed configuration.

joshdover · 2022-02-11T11:35:16Z

Does this mean setup will only run once during kibana startup?
I'm asking because ensuring preconfiguration is called during fleet setup, so if we only call it once, it makes this discussion irrelevant:
#124004

It will run every time a Kibana node is restarted, so it's not exactly only once, but it won't be triggered multiple times in a single run of Kibana unless the user manually calls the setup API. I think since it will still run on each startup, #124004 is still relevant?

juliaElastic · 2022-02-11T11:51:13Z

It will run every time a Kibana node is restarted, so it's not exactly only once, but it won't be triggered multiple times in a single run of Kibana unless the user manually calls the setup API. I think since it will still run on each startup, #124004 is still relevant?

Sounds right, so we might have to ask users to restart kibana to refresh/fix the preconfigured policies

joshdover · 2022-02-11T11:56:15Z

Sounds right, so we might have to ask users to restart kibana to refresh/fix the preconfigured policies

Yes, or use the API manually. I also think we may see this issue go away or resolve itself with the addition of bundled packages used for preconfiguration.

joshdover · 2022-02-14T15:49:06Z

I've been thinking about this more and I think we should consider delaying this change until 8.3 or later. We've had a lot of changes in Fleet's setup logic in the past several releases and I think we could benefit from having additional time to make improvements in testing and getting feedback from customers, support & Cloud before we move forward with blocking Kibana's startup. Blocking Kibana startup carries a high weight of responsibility since anything that goes wrong in our setup code will make the entire UI unusable.

The main motivation for this change is to improve reliability of Stack upgrades by making it more obvious to sysadmins to not upgrade ingest components (specifically, Fleet Server and Elastic Agent) until after Kibana has upgraded the necessary ingest assets to avoid breaking ingest for Elastic Agent monitoring, Synthetics, and APM. Endpoint is not affected since Agent downloads the the version that corresponds to the integration package version. I also believe Synthetics is not affected in practice because Heartbeat has not traditionally had any breaking changes in schema.

This change will not actually enforce that sysadmins do not upgrade Fleet Server or Elastic Agent before Kibana has upgraded the integration packages, so it is not a guaranteed improvement, but a probabilistic one. The window of time between Kibana's UI being available and Fleet setup completing is quite small (we're talking ~20s in the average case, possibly 5 minutes in the worst case) so the window of time a user could hit this scenario is narrow.

It's also worth noting that this change will affect all users of the Stack, whether or not they're using Fleet yet. I think the risks of breaking Kibana for this large population is higher than the risk we're currently taking on by having this small window of time open where it's not obvious to a sysadmin whether or not they can start upgrading the ingest components.

I'd like some feedback from affected parties, namely:

APM Server - @simitt
Synthetics - @andrewvc
Kibana Core - @lukeelmers
Elastic Agent control plane - @ph

ph · 2022-02-14T16:46:36Z

@joshdover +1 to delay it, the changes in the setup logic have not been easy, we will improve.
Could we prioritize an automated test for 8.2 for cloud this will help us iterate on 8.3.

lukeelmers · 2022-02-14T22:40:56Z

@joshdover +1 from me to delay as well, and revisit the need for it later.

In general we'd like to avoid opening up a mechanism like this from core unless it is a last resort. As you mention, this is something which would affect all stack users... so I'd prefer to exhaust all other possible options before taking this step.

simitt · 2022-02-15T13:04:08Z

APM Server has implemented a check for 8.0 to verify that the installed apm packages are version aligned with the apm-server, and only starts sending data to ES once this requirement is fulfilled. This was necessary to not already run into problems in 8.0/8.1. No need for blocking the startup anymore from apm side (cc @elastic/apm-server ).

joshdover · 2022-02-15T16:28:50Z

No need for blocking the startup anymore from apm side

@simitt To be clear, do we not ever need this or is it just not as pressing? My understanding was that it's possible APM could get backed up and start dropping traces if there's a delay in completing Fleet setup for some reason. By blocking, we'd be able to more easily enable orchestration layers like Cloud and ECK to delay upgrading Agents or APM Server until Fleet setup has completed in Kibana.

Alternatively, we could start publishing a degraded or unavailable health status from Kibana's status API. However, today Cloud does not use this endpoint to decide when to start upgrading the other components, so we'd need to make some changes there.

axw · 2022-02-17T02:56:38Z

Silvia is out sick, so I'll take a stab at answering.

The check that Silvia mentioned in #120616 (comment) does mean that ingestion will be blocked and start dropping data if installing/upgrading the APM package doesn't happen in a timely manner. I think we would still like this eventually, but not as urgently.

joshdover added the Team:Fleet Team label for Observability Data Collection Fleet team label Dec 7, 2021

joshdover added the enhancement New value added to drive a business result label Dec 7, 2021

joshdover added the v8.2.0 label Dec 7, 2021

jasonrhodes mentioned this issue Dec 8, 2021

Allow Fleet to complete package upgrade before Kibana server is ready #108993

Closed

joshdover mentioned this issue Dec 13, 2021

[Fleet] Changes in package install format should be applied on Stack upgrades #121099

Closed

5 tasks

This was referenced Jan 6, 2022

Provide a bulk API for creating ingest assets elastic/elasticsearch#77505

Closed

[Fleet] Evaluate Fleet page load performance and steps to improve #118751

Closed

joshdover mentioned this issue Jan 24, 2022

[FTR] Start dockerServers before ES & Kibana #123592

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Block Kibana startup for Fleet setup completion #120616

[Fleet] Block Kibana startup for Fleet setup completion #120616

joshdover commented Dec 7, 2021 •

edited

Loading

elasticmachine commented Dec 7, 2021

joshdover commented Dec 7, 2021

joshdover commented Feb 10, 2022

kpollich commented Feb 10, 2022

juliaElastic commented Feb 10, 2022 •

edited

Loading

juliaElastic commented Feb 10, 2022 •

edited

Loading

joshdover commented Feb 11, 2022 •

edited

Loading

joshdover commented Feb 11, 2022

juliaElastic commented Feb 11, 2022 •

edited

Loading

joshdover commented Feb 11, 2022

joshdover commented Feb 14, 2022 •

edited

Loading

ph commented Feb 14, 2022

lukeelmers commented Feb 14, 2022

simitt commented Feb 15, 2022

joshdover commented Feb 15, 2022 •

edited

Loading

axw commented Feb 17, 2022

[Fleet] Block Kibana startup for Fleet setup completion #120616

[Fleet] Block Kibana startup for Fleet setup completion #120616

Comments

joshdover commented Dec 7, 2021 • edited Loading

elasticmachine commented Dec 7, 2021

joshdover commented Dec 7, 2021

joshdover commented Feb 10, 2022

kpollich commented Feb 10, 2022

juliaElastic commented Feb 10, 2022 • edited Loading

juliaElastic commented Feb 10, 2022 • edited Loading

joshdover commented Feb 11, 2022 • edited Loading

joshdover commented Feb 11, 2022

juliaElastic commented Feb 11, 2022 • edited Loading

joshdover commented Feb 11, 2022

joshdover commented Feb 14, 2022 • edited Loading

ph commented Feb 14, 2022

lukeelmers commented Feb 14, 2022

simitt commented Feb 15, 2022

joshdover commented Feb 15, 2022 • edited Loading

axw commented Feb 17, 2022

joshdover commented Dec 7, 2021 •

edited

Loading

juliaElastic commented Feb 10, 2022 •

edited

Loading

juliaElastic commented Feb 10, 2022 •

edited

Loading

joshdover commented Feb 11, 2022 •

edited

Loading

juliaElastic commented Feb 11, 2022 •

edited

Loading

joshdover commented Feb 14, 2022 •

edited

Loading

joshdover commented Feb 15, 2022 •

edited

Loading