-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Block Kibana startup for Fleet setup completion #120616
Comments
Pinging @elastic/fleet (Team:Fleet) |
cc @elastic/kibana-core @kobelb Please let us know if there are any other items you'd like to see before we start blocking Kibana boot. |
@joshdover LGTM! |
Does this mean setup will only run once during kibana startup? |
How much risk is it to schedule this task at the same iteration as the prerequisite in kibana-core? Also, what is the impact on downstream dependencies like fleet-server, e2e-testing? Do they have to change anything? E.g. they are calling Fleet setup API to check that Fleet is ready What happens for users who start with Kibana installation without Fleet, and enable Fleet later? Is Fleet setup going to run then? We should test this in cloud/on-prem. |
Kibana Core has this scheduled in their current sprint which ends in ~1.5 weeks and a rough plan has been agreed upon. I suspect they won't be a blocker for much longer.
This is designed to not break the setup API. Though calling that API won't be necessary any longer (just verifying Kibana is healthy is good enough now), we won't be removing the setup API at this point (but we should consider deprecating it in a later release).
In 8.0 we started running Fleet setup when Kibana starts up, regardless of whether or not the user is using Fleet. The change in this issue only impacts one aspect which is to block Kibana's HTTP server from serving traffic until Fleet setup has completed. This is necessary to ensure that users and orchestration layers do not upgrade Elastic Agent instances until 1st party Stack-aligned Fleet packages (eg. APM, Synthetics) are upgraded. If Elastic Agent or Fleet Server is upgraded before these packages are upgraded, new fields may be ingested with the incorrect mappings, breaking the related application UIs. In summary, this change doesn't change what we're setting up or how we decide to do it, it only improves the robustness of the ingestion layers in the Stack. Also worth noting that since we removed the default policies, Fleet setup does not install any packages or create agent policies in the default self-managed configuration. |
It will run every time a Kibana node is restarted, so it's not exactly only once, but it won't be triggered multiple times in a single run of Kibana unless the user manually calls the setup API. I think since it will still run on each startup, #124004 is still relevant? |
Sounds right, so we might have to ask users to restart kibana to refresh/fix the preconfigured policies |
Yes, or use the API manually. I also think we may see this issue go away or resolve itself with the addition of bundled packages used for preconfiguration. |
I've been thinking about this more and I think we should consider delaying this change until 8.3 or later. We've had a lot of changes in Fleet's setup logic in the past several releases and I think we could benefit from having additional time to make improvements in testing and getting feedback from customers, support & Cloud before we move forward with blocking Kibana's startup. Blocking Kibana startup carries a high weight of responsibility since anything that goes wrong in our setup code will make the entire UI unusable. The main motivation for this change is to improve reliability of Stack upgrades by making it more obvious to sysadmins to not upgrade ingest components (specifically, Fleet Server and Elastic Agent) until after Kibana has upgraded the necessary ingest assets to avoid breaking ingest for Elastic Agent monitoring, Synthetics, and APM. Endpoint is not affected since Agent downloads the the version that corresponds to the integration package version. I also believe Synthetics is not affected in practice because Heartbeat has not traditionally had any breaking changes in schema. This change will not actually enforce that sysadmins do not upgrade Fleet Server or Elastic Agent before Kibana has upgraded the integration packages, so it is not a guaranteed improvement, but a probabilistic one. The window of time between Kibana's UI being available and Fleet setup completing is quite small (we're talking ~20s in the average case, possibly 5 minutes in the worst case) so the window of time a user could hit this scenario is narrow. It's also worth noting that this change will affect all users of the Stack, whether or not they're using Fleet yet. I think the risks of breaking Kibana for this large population is higher than the risk we're currently taking on by having this small window of time open where it's not obvious to a sysadmin whether or not they can start upgrading the ingest components. I'd like some feedback from affected parties, namely:
|
@joshdover +1 to delay it, the changes in the setup logic have not been easy, we will improve. |
@joshdover +1 from me to delay as well, and revisit the need for it later. In general we'd like to avoid opening up a mechanism like this from core unless it is a last resort. As you mention, this is something which would affect all stack users... so I'd prefer to exhaust all other possible options before taking this step. |
APM Server has implemented a check for |
@simitt To be clear, do we not ever need this or is it just not as pressing? My understanding was that it's possible APM could get backed up and start dropping traces if there's a delay in completing Fleet setup for some reason. By blocking, we'd be able to more easily enable orchestration layers like Cloud and ECK to delay upgrading Agents or APM Server until Fleet setup has completed in Kibana. Alternatively, we could start publishing a degraded or unavailable health status from Kibana's status API. However, today Cloud does not use this endpoint to decide when to start upgrading the other components, so we'd need to make some changes there. |
Silvia is out sick, so I'll take a stab at answering. The check that Silvia mentioned in #120616 (comment) does mean that ingestion will be blocked and start dropping data if installing/upgrading the APM package doesn't happen in a timely manner. I think we would still like this eventually, but not as urgently. |
In order to provide a smoother upgrade experience for users of Fleet, we would like to block Kibana startup in order to install necessary ingest assets into Elasticsearch. The purpose of blocking is to provide admins a clear point in time where it's safe to start upgrading other components that depend on Fleet ingest assets being upgraded, such as Fleet Server and Elastic Agent.
In order to make this change as non-disruptive as possible, we want to ensure:
To that end, before we start blocking Kibana startup we need to complete the following tasks:
The scope of this issue should make the following changes in a single PR:
xpack.fleet.setup.max_retries
config that defaults to5
max_retries
, then throw exception and crash Kibana/api/fleet/setup
API from UI codeOptionally, we'd like to improve package install and upgrade performance to minimize the impact of blocking Kibana startup. While this is likely not to be considered a blocker, it may make upgrades more painful or confusing to users. The primary bottleneck in this process is in Elasticsearch, which elastic/elasticsearch#77505 may solve.
The text was updated successfully, but these errors were encountered: