Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use correct cluster state version for node fault detection #30810

Merged
merged 2 commits into from
May 23, 2018

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented May 23, 2018

Since its introduction in ES 1.4, node fault detection has been using the wrong cluster state version to send as part of the ping request, by using always the constant -1 (ClusterState.UNKNOWN_VERSION). This can, in an unfortunate series of events, lead to a situation where a previous stale master can regain its authority and revert the cluster to an older state.

This PR makes NodesFaultDetection use the correct current cluster state for sending ping requests, avoiding the situation where a stale master possibly forces a newer master to step down and rejoin the stale one.

@ywelsch ywelsch added >bug v7.0.0 v6.3.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v5.6.10 v6.4.0 labels May 23, 2018
@ywelsch ywelsch requested review from bleskes and DaveCTurner May 23, 2018 12:36
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good catch. I am assuming that calling rejoin() is basically fine at any point in time. It looks like it doesn't have any terribly complicated preconditions, but this stuff is subtle.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It's clear case of "doh". good catch.

@ywelsch ywelsch merged commit cfa2b06 into elastic:master May 23, 2018
ywelsch added a commit that referenced this pull request May 23, 2018
Since its introduction in ES 1.4, node fault detection has been using the wrong cluster state version to send
as part of the ping request, by using always the constant -1 (ClusterState.UNKNOWN_VERSION). This can, in an
unfortunate series of events, lead to a situation where a previous stale master can regain its authority and
revert the cluster to an older state.

This commit makes NodesFaultDetection use the correct current cluster state for sending ping requests, avoiding
the situation where a stale master possibly forces a newer master to step down and rejoin the stale one.
ywelsch added a commit that referenced this pull request May 23, 2018
Since its introduction in ES 1.4, node fault detection has been using the wrong cluster state version to send
as part of the ping request, by using always the constant -1 (ClusterState.UNKNOWN_VERSION). This can, in an
unfortunate series of events, lead to a situation where a previous stale master can regain its authority and
revert the cluster to an older state.

This commit makes NodesFaultDetection use the correct current cluster state for sending ping requests, avoiding
the situation where a stale master possibly forces a newer master to step down and rejoin the stale one.
ywelsch added a commit that referenced this pull request May 23, 2018
Since its introduction in ES 1.4, node fault detection has been using the wrong cluster state version to send
as part of the ping request, by using always the constant -1 (ClusterState.UNKNOWN_VERSION). This can, in an
unfortunate series of events, lead to a situation where a previous stale master can regain its authority and
revert the cluster to an older state.

This commit makes NodesFaultDetection use the correct current cluster state for sending ping requests, avoiding
the situation where a stale master possibly forces a newer master to step down and rejoin the stale one.
@ywelsch
Copy link
Contributor Author

ywelsch commented May 23, 2018

Thanks @DaveCTurner @bleskes

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 23, 2018
* master:
  [DOCS] Splits auditing.asciidoc into smaller files
  Reintroduce mandatory http pipelining support (elastic#30820)
  Painless: Types Section Clean Up (elastic#30283)
  Add support for indexed shape routing in geo_shape query (elastic#30760)
  [test] java tests for archive packaging (elastic#30734)
  Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813)
  [DOCS] Fix more edit URLs in Stack Overview (elastic#30704)
  Use correct cluster state version for node fault detection (elastic#30810)
  Change serialization version of doc-value fields.
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (elastic#30797)
  [TEST] Don't expect acks when isolating nodes
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 23, 2018
* master: (25 commits)
  [DOCS] Splits auditing.asciidoc into smaller files
  Reintroduce mandatory http pipelining support (elastic#30820)
  Painless: Types Section Clean Up (elastic#30283)
  Add support for indexed shape routing in geo_shape query (elastic#30760)
  [test] java tests for archive packaging (elastic#30734)
  Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813)
  [DOCS] Fix more edit URLs in Stack Overview (elastic#30704)
  Use correct cluster state version for node fault detection (elastic#30810)
  Change serialization version of doc-value fields.
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (elastic#30797)
  [TEST] Don't expect acks when isolating nodes
  Add a `format` option to `docvalue_fields`. (elastic#29639)
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Mustes {p0=snapshot.get_repository/10_basic/*} YAML test
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (elastic#30743)
  Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled
  Use original settings on full-cluster restart (elastic#30780)
  Only ack cluster state updates successfully applied on all nodes (elastic#30672)
  ...
dnhatn added a commit that referenced this pull request May 24, 2018
* 6.x:
  [DOCS] Fixes typos in security settings
  Add support for indexed shape routing in geo_shape query (#30760)
  [DOCS] Splits auditing.asciidoc into smaller files
  Painless: Types Section Clean Up (#30283)
  [test] java tests for archive packaging (#30734)
  Deprecate http.pipelining setting (#30786)
  [DOCS] Fix more edit URLs in Stack Overview (#30704)
  Use correct cluster state version for node fault detection (#30810)
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (#30797)
  Change serialization version of doc-value fields.
  Add a `format` option to `docvalue_fields`. (#29639)
  [TEST] Don't expect acks when isolating nodes
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Revert "Add more yaml tests for get alias API (#29513)"
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (#30743)
  Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled
  Use original settings on full-cluster restart (#30780)
  Only ack cluster state updates successfully applied on all nodes (#30672)
  Replace Request#setHeaders with addHeader (#30588)
  [TEST] remove endless wait in RestClientTests (#30776)
  QA: Add xpack tests to rolling upgrade (#30795)
  Add support for search templates to the high-level REST client. (#30473)
  Reduce CLI scripts to one-liners on Windows (#30772)
  Fold RestGetAllSettingsAction in RestGetSettingsAction (#30561)
  Add more yaml tests for get alias API (#29513)
  [Docs] Fix script-fields snippet execution (#30693)
  Convert FieldCapabilitiesResponse to a ToXContentObject. (#30182)
  Remove assert statements from field caps documentation. (#30601)
  Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (#30181)
  Add support for field capabilities to the high-level REST client. (#29664)
  [DOCS] Add SAML configuration information (#30548)
  [DOCS] Remove X-Pack references from SQL CLI (#30694)
  [Docs] Fix typo in circuit breaker docs (#29659)
  [Feature] Adding a char_group tokenizer (#24186)
  Increase the maximum number of filters that may be in the cache. (#30655)
  [Docs] Fix broken cross link in documentation
  Test: wait for netty threads in a JUnit ClassRule (#30763)
  [Security] Include an empty json object in an json array when FLS filters out all fields (#30709)
  [DOCS] fixed incorrect default
  [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk
  Enable installing plugins from snapshots.elastic.co (#30765)
  Ignore empty completion input (#30713)
  Fix docs failure on language analyzers (#30722)
  [Docs] Fix inconsistencies in snapshot/restore doc (#30480)
  Add Delete Repository High Level REST API (#30666)
  Reduce CLI scripts to one-liners (#30759)
dnhatn added a commit that referenced this pull request May 24, 2018
* master:
  [DOCS] Fixes typos in security settings
  Fix GeoShapeQueryBuilder serialization after backport
  [DOCS] Splits auditing.asciidoc into smaller files
  Reintroduce mandatory http pipelining support (#30820)
  Painless: Types Section Clean Up (#30283)
  Add support for indexed shape routing in geo_shape query (#30760)
  [test] java tests for archive packaging (#30734)
  Revert "Make http pipelining support mandatory (#30695)" (#30813)
  [DOCS] Fix more edit URLs in Stack Overview (#30704)
  Use correct cluster state version for node fault detection (#30810)
  Change serialization version of doc-value fields.
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (#30797)
  [TEST] Don't expect acks when isolating nodes
  Add a `format` option to `docvalue_fields`. (#29639)
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Mustes {p0=snapshot.get_repository/10_basic/*} YAML test
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (#30743)
  Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled
  Use original settings on full-cluster restart (#30780)
  Only ack cluster state updates successfully applied on all nodes (#30672)
  Expose Lucene's FeatureField. (#30618)
  Fix a grammatical error in the 'search types' documentation.
  Remove http pipelining from integration test case (#30788)
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request May 25, 2018
* es/ccr: (55 commits)
  [DOCS] Fixes typos in security settings
  Fix GeoShapeQueryBuilder serialization after backport
  [DOCS] Splits auditing.asciidoc into smaller files
  Reintroduce mandatory http pipelining support (elastic#30820)
  Painless: Types Section Clean Up (elastic#30283)
  Add support for indexed shape routing in geo_shape query (elastic#30760)
  [test] java tests for archive packaging (elastic#30734)
  Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813)
  [DOCS] Fix more edit URLs in Stack Overview (elastic#30704)
  Use correct cluster state version for node fault detection (elastic#30810)
  Change serialization version of doc-value fields.
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (elastic#30797)
  [TEST] Don't expect acks when isolating nodes
  Mute CorruptedFileIT in CCR
  Add a `format` option to `docvalue_fields`. (elastic#29639)
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Mustes {p0=snapshot.get_repository/10_basic/*} YAML test
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (elastic#30743)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v5.6.10 v6.3.0 v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants