Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform successful Elasticsearch version check before migrations #51311

Merged
merged 21 commits into from
Jan 31, 2020

Conversation

rudolf
Copy link
Contributor

@rudolf rudolf commented Nov 21, 2019

Summary

Fixes #49785 #14480

Testing notes:

  1. When starting Kibana against a cluster with an unsupported ES node, Kibana should log:

    Waiting until all Elasticsearch nodes are compatible with Kibana before starting saved objects migrations...

    but should not start saved object migrations until all the nodes are compatible

    Starting saved objects migrations

  2. If Kibana has successfully started with a compatible ES cluster, but then an incompatible ES node joins the cluster, Kibana's "elasticsearch" plugin should go into a red state and reloading Kibana in a browser should render the status page.

Release note:

This fix addresses a regression where Kibana would not check that all Elasticsearch nodes are compatible before starting Saved Object migrations.

Checklist

Use strikethroughs to remove checklist items you don't feel are applicable to this PR.

For maintainers

@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:New Platform v8.0.0 v7.5.0 v7.6.0 labels Nov 21, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@elasticmachine
Copy link
Contributor

💔 Build Failed

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw too late that this was only a draft! Kept only my NITs on current progress.

@rudolf rudolf removed the v7.5.0 label Nov 25, 2019
@joshdover
Copy link
Contributor

@rudolf What's the status here?

@rudolf rudolf added v7.7.0 and removed v7.6.0 labels Jan 15, 2020
@rudolf rudolf force-pushed the healthcheck-before-migrations branch from 5c91f29 to 07829f8 Compare January 22, 2020 15:19
@rudolf rudolf force-pushed the healthcheck-before-migrations branch from 07829f8 to bd49618 Compare January 24, 2020 14:21
@rudolf rudolf force-pushed the healthcheck-before-migrations branch from bd49618 to e2d6157 Compare January 24, 2020 14:53
this.logger.debug(
'Waiting until all Elasticsearch nodes are compatible with Kibana before starting saved objects migrations...'
);
await this.setupDeps!.elasticsearch.esNodesCompatibility$.pipe(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour isn't 100% the same as what it was in legacy. In legacy we would start the status service so even though migrations wouldn't run, there would be a running server which showed that the Elasticsearch plugin was red with the reason which helps surface the underlying problem. Once we have a status service in NP we should aim to create similar behaviour.

@rudolf rudolf marked this pull request as ready for review January 26, 2020 21:18
@rudolf rudolf requested a review from a team as a code owner January 26, 2020 21:18
@rudolf rudolf added the v7.6.1 label Jan 26, 2020
Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#49785 states

Although the reasons for abandoning the health check still stand, we will have to keep polling to do the version check since new Elasticsearch nodes can join an existing cluster after Kibana has started up.

In current PR, we are only waiting for ES to be ready once to trigger some actions, but further state change are doing nothing. Do we know what we are planning to do in case of scenario like

  • red -> green (trigger start of SO + legacy's waitUntilReady) -> red (atm do nothing) -> green (atm do nothing)

src/core/server/elasticsearch/elasticsearch_config.ts Outdated Show resolved Hide resolved
Comment on lines 24 to 36
esNodesCompatibility$.subscribe(({ isCompatible, message, kibanaVersion, warningNodes }) => {
if (!isCompatible) {
esPlugin.status.red(message);
} else {
if (message && message.length > 0) {
logWithMetadata(['warning'], message, {
kibanaVersion,
nodes: warningNodes,
});
}
esPlugin.status.green('Ready');
resolve();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we unsubscribe after the first resolve call to avoid wrongly recalling resolve in case of green->red->green ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to keep updating the status so we need the subscription. Although resolve() should only be called once, calling it multiple times is a no-op so it won't cause any problems.

Comment on lines +294 to +297
await this.setupDeps!.elasticsearch.esNodesCompatibility$.pipe(
filter(nodes => nodes.isCompatible),
take(1)
).toPromise();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Except in the legacy code you adapted, we are not displaying any info message for the user about the fact that we are waiting (maybe indefinitely) for ES to be ready?
  2. Maybe we should add a timeout and throw a fatal after some time? Or are we expecting Kibana to hang indefinitely waiting for this condition?
  3. Should this check be done at a higher level (thinking in the Server)? It seems to me that waiting for ES to be ready is higher responsibility than the SOService should handle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Except in the legacy code you adapted, we are not displaying any info message for the user about the fact that we are waiting (maybe indefinitely) for ES to be ready?

I've changed the log message to an info to indicate that we're waiting for ES and when we're starting migrations.

  1. Maybe we should add a timeout and throw a fatal after some time? Or are we expecting Kibana to hang indefinitely waiting for this condition?

The existing behaviour is to wait indefinitely. It could take a day before a faulty cluster is fixed, in such a case I think it's nice if Kibana just starts working again automatically.

  1. Should this check be done at a higher level (thinking in the Server)? It seems to me that waiting for ES to be ready is higher responsibility than the SOService should handle.

I don't have a strong opinion, but I think if the SO Service has some dependency on an external condition then the logic to wait for that condition belongs in the SO Service. This is a minor, but when it comes to the logging tags it might make it easier to see that these logs are related if they all have the same tags, rather than some being tagged server and others savedobjects-service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it might be good to repeat this message on an interval but I wouldn't consider that a blocker to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion, but I think if the SO Service has some dependency on an external condition then the logic to wait for that condition belongs in the SO Service. This is a minor, but when it comes to the logging tags it might make it easier to see that these logs are related if they all have the same tags, rather than some being tagged server and others savedobjects-service.

I think it's fine we put this in SO service until/if there are other Core services that require this as well.

@rudolf
Copy link
Contributor Author

rudolf commented Jan 31, 2020

After further testing I realised there were two incorrect behaviours:

  1. pollEsNodesVersion would not start polling immediately, but the first poll would be scheduled to happen after esVersionCheckInterval. For a long interval like 30s this blocks kibana from starting up with an additional 30s.
  2. Because pollEsNodesVersion used a switchMap there's a risk that we'll bombard ES with queries if the nodes.info response time > esVersionCheckInterval. Since this is an expensive api call there's a risk that we'll suddenly make a slow cluster much slower. I've changed the implementation to use an exhaustMap so we won't schedule new requests until the previous request has resolved.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -29,7 +29,7 @@ let mockDefaultRouteSetting: any = '';
describe('default route provider', () => {
let root: Root;
beforeAll(async () => {
root = kbnTestServer.createRoot();
root = kbnTestServer.createRoot({ migrations: { skip: true } });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: It seems you adapted every call to createRoot to add this. Should we set migrations: { skip: true } as a default in kbnTestServer.createRoot ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests run with esArchiver and then they need to apply migrations. Ideally we shouldn't disable migrations, but we should disable the ES version check itself. There is elasticsearch.ignoreVersionMismatch but it's only available in development and our integration tests run against production. We could just make this option available in production, but I think it warrants a bigger discussion so I created #56505

@tylersmalley tylersmalley merged commit f1068cd into elastic:master Jan 31, 2020
tylersmalley pushed a commit to tylersmalley/kibana that referenced this pull request Jan 31, 2020
…stic#51311)

* Convert parts of Elasticsearch version check to ts
* Move ES version check to NP
* Improve types
* Wait till for compatible ES nodes before SO migrations
* Don't wait for ES compatibility if skipMigrations=true
* Legacy Elasticsearch plugin integration test
* Make ES compatibility check and migrations logging more visible
* Test for isCompatible=false when ES version check throws
* Start pollEsNodesVersion immediately
* Refactor pollEsNodesVersion
@tylersmalley
Copy link
Contributor

@rudolf I have backported to 7.x, but 7.6 is a bit more challenging as the tests rely on saved_objects_service.test.mocks which was added here #55012, which relies on #55156 - both not present in 7.6. I think it would be best for you to handle the conflict resolution here to ensure there are no errors.

rudolf added a commit that referenced this pull request Feb 3, 2020
) (#56600)

* Convert parts of Elasticsearch version check to ts
* Move ES version check to NP
* Improve types
* Wait till for compatible ES nodes before SO migrations
* Don't wait for ES compatibility if skipMigrations=true
* Legacy Elasticsearch plugin integration test
* Make ES compatibility check and migrations logging more visible
* Test for isCompatible=false when ES version check throws
* Start pollEsNodesVersion immediately
* Refactor pollEsNodesVersion
rudolf added a commit that referenced this pull request Feb 3, 2020
) (#56629)

* Convert parts of Elasticsearch version check to ts
* Move ES version check to NP
* Improve types
* Wait till for compatible ES nodes before SO migrations
* Don't wait for ES compatibility if skipMigrations=true
* Legacy Elasticsearch plugin integration test
* Make ES compatibility check and migrations logging more visible
* Test for isCompatible=false when ES version check throws
* Start pollEsNodesVersion immediately
* Refactor pollEsNodesVersion
@rudolf rudolf deleted the healthcheck-before-migrations branch February 4, 2020 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker bug Fixes for quality problems that affect the customer experience Feature:New Platform release_note:fix Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.6.0 v7.6.1 v7.7.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Elasticsearch version check to New Platform
8 participants