-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use correct cluster state version for node fault detection #30810
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, good catch. I am assuming that calling rejoin()
is basically fine at any point in time. It looks like it doesn't have any terribly complicated preconditions, but this stuff is subtle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It's clear case of "doh". good catch.
Since its introduction in ES 1.4, node fault detection has been using the wrong cluster state version to send as part of the ping request, by using always the constant -1 (ClusterState.UNKNOWN_VERSION). This can, in an unfortunate series of events, lead to a situation where a previous stale master can regain its authority and revert the cluster to an older state. This commit makes NodesFaultDetection use the correct current cluster state for sending ping requests, avoiding the situation where a stale master possibly forces a newer master to step down and rejoin the stale one.
Since its introduction in ES 1.4, node fault detection has been using the wrong cluster state version to send as part of the ping request, by using always the constant -1 (ClusterState.UNKNOWN_VERSION). This can, in an unfortunate series of events, lead to a situation where a previous stale master can regain its authority and revert the cluster to an older state. This commit makes NodesFaultDetection use the correct current cluster state for sending ping requests, avoiding the situation where a stale master possibly forces a newer master to step down and rejoin the stale one.
Since its introduction in ES 1.4, node fault detection has been using the wrong cluster state version to send as part of the ping request, by using always the constant -1 (ClusterState.UNKNOWN_VERSION). This can, in an unfortunate series of events, lead to a situation where a previous stale master can regain its authority and revert the cluster to an older state. This commit makes NodesFaultDetection use the correct current cluster state for sending ping requests, avoiding the situation where a stale master possibly forces a newer master to step down and rejoin the stale one.
Thanks @DaveCTurner @bleskes |
* master: [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (elastic#30820) Painless: Types Section Clean Up (elastic#30283) Add support for indexed shape routing in geo_shape query (elastic#30760) [test] java tests for archive packaging (elastic#30734) Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813) [DOCS] Fix more edit URLs in Stack Overview (elastic#30704) Use correct cluster state version for node fault detection (elastic#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (elastic#30797) [TEST] Don't expect acks when isolating nodes
* master: (25 commits) [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (elastic#30820) Painless: Types Section Clean Up (elastic#30283) Add support for indexed shape routing in geo_shape query (elastic#30760) [test] java tests for archive packaging (elastic#30734) Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813) [DOCS] Fix more edit URLs in Stack Overview (elastic#30704) Use correct cluster state version for node fault detection (elastic#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (elastic#30797) [TEST] Don't expect acks when isolating nodes Add a `format` option to `docvalue_fields`. (elastic#29639) Fixes UpdateSettingsRequestStreamableTests mutate bug Mustes {p0=snapshot.get_repository/10_basic/*} YAML test Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (elastic#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (elastic#30780) Only ack cluster state updates successfully applied on all nodes (elastic#30672) ...
* 6.x: [DOCS] Fixes typos in security settings Add support for indexed shape routing in geo_shape query (#30760) [DOCS] Splits auditing.asciidoc into smaller files Painless: Types Section Clean Up (#30283) [test] java tests for archive packaging (#30734) Deprecate http.pipelining setting (#30786) [DOCS] Fix more edit URLs in Stack Overview (#30704) Use correct cluster state version for node fault detection (#30810) [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (#30797) Change serialization version of doc-value fields. Add a `format` option to `docvalue_fields`. (#29639) [TEST] Don't expect acks when isolating nodes Fixes UpdateSettingsRequestStreamableTests mutate bug Revert "Add more yaml tests for get alias API (#29513)" Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (#30780) Only ack cluster state updates successfully applied on all nodes (#30672) Replace Request#setHeaders with addHeader (#30588) [TEST] remove endless wait in RestClientTests (#30776) QA: Add xpack tests to rolling upgrade (#30795) Add support for search templates to the high-level REST client. (#30473) Reduce CLI scripts to one-liners on Windows (#30772) Fold RestGetAllSettingsAction in RestGetSettingsAction (#30561) Add more yaml tests for get alias API (#29513) [Docs] Fix script-fields snippet execution (#30693) Convert FieldCapabilitiesResponse to a ToXContentObject. (#30182) Remove assert statements from field caps documentation. (#30601) Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (#30181) Add support for field capabilities to the high-level REST client. (#29664) [DOCS] Add SAML configuration information (#30548) [DOCS] Remove X-Pack references from SQL CLI (#30694) [Docs] Fix typo in circuit breaker docs (#29659) [Feature] Adding a char_group tokenizer (#24186) Increase the maximum number of filters that may be in the cache. (#30655) [Docs] Fix broken cross link in documentation Test: wait for netty threads in a JUnit ClassRule (#30763) [Security] Include an empty json object in an json array when FLS filters out all fields (#30709) [DOCS] fixed incorrect default [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk Enable installing plugins from snapshots.elastic.co (#30765) Ignore empty completion input (#30713) Fix docs failure on language analyzers (#30722) [Docs] Fix inconsistencies in snapshot/restore doc (#30480) Add Delete Repository High Level REST API (#30666) Reduce CLI scripts to one-liners (#30759)
* master: [DOCS] Fixes typos in security settings Fix GeoShapeQueryBuilder serialization after backport [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (#30820) Painless: Types Section Clean Up (#30283) Add support for indexed shape routing in geo_shape query (#30760) [test] java tests for archive packaging (#30734) Revert "Make http pipelining support mandatory (#30695)" (#30813) [DOCS] Fix more edit URLs in Stack Overview (#30704) Use correct cluster state version for node fault detection (#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (#30797) [TEST] Don't expect acks when isolating nodes Add a `format` option to `docvalue_fields`. (#29639) Fixes UpdateSettingsRequestStreamableTests mutate bug Mustes {p0=snapshot.get_repository/10_basic/*} YAML test Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (#30780) Only ack cluster state updates successfully applied on all nodes (#30672) Expose Lucene's FeatureField. (#30618) Fix a grammatical error in the 'search types' documentation. Remove http pipelining from integration test case (#30788)
* es/ccr: (55 commits) [DOCS] Fixes typos in security settings Fix GeoShapeQueryBuilder serialization after backport [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (elastic#30820) Painless: Types Section Clean Up (elastic#30283) Add support for indexed shape routing in geo_shape query (elastic#30760) [test] java tests for archive packaging (elastic#30734) Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813) [DOCS] Fix more edit URLs in Stack Overview (elastic#30704) Use correct cluster state version for node fault detection (elastic#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (elastic#30797) [TEST] Don't expect acks when isolating nodes Mute CorruptedFileIT in CCR Add a `format` option to `docvalue_fields`. (elastic#29639) Fixes UpdateSettingsRequestStreamableTests mutate bug Mustes {p0=snapshot.get_repository/10_basic/*} YAML test Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (elastic#30743) ...
Since its introduction in ES 1.4, node fault detection has been using the wrong cluster state version to send as part of the ping request, by using always the constant
-1
(ClusterState.UNKNOWN_VERSION
). This can, in an unfortunate series of events, lead to a situation where a previous stale master can regain its authority and revert the cluster to an older state.This PR makes
NodesFaultDetection
use the correct current cluster state for sending ping requests, avoiding the situation where a stale master possibly forces a newer master to step down and rejoin the stale one.