Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap a Zen2 cluster once quorum is discovered #37463

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Jan 15, 2019

Today when bootstrapping a Zen2 cluster we wait for every node in the
initial_master_nodes setting to be discovered, so that we can map the
node names or addresses in the initial_master_nodes list to their IDs for
inclusion in the initial voting configuration. This means that if any of
the expected master-eligible nodes fails to start then bootstrapping will
not occur and the cluster will not form. This is not ideal, and we would
prefer the cluster to bootstrap even if some of the master-eligible nodes
do not start.

Safe bootstrapping requires that all pairs of quorums of all initial
configurations overlap, and this is particularly troublesome to ensure
given that nodes may be concurrently and independently attempting to
bootstrap the cluster. The solution is to bootstrap using an initial
configuration whose size matches the size of the expected set of
master-eligible nodes, but with the unknown IDs replaced by "placeholder"
IDs that can never belong to any node. Any quorum of received votes in any
of these placeholder-laden initial configurations is also a quorum of the
"true" initial set of master-eligible nodes, giving the guarantee that it
intersects all other quorums as required.

Note that this change means that the initial configuration is not
necessarily robust to any node failures. Normally the cluster will form and
then auto-reconfigure to a more robust configuration in which the
placeholder IDs are replaced by the IDs of genuine nodes as they join the
cluster; however if a node fails between bootstrapping and this
auto-reconfiguration then the cluster may become unavailable. This we feel
to be less likely than a node failing to start at all.

This commit also enormously simplifies the cluster bootstrapping process.
Today, the cluster bootstrapping process involves two (local) transport actions
in order to support a flexible bootstrapping API and to make it easily
accessible to plugins. However this flexibility is not required for the current
design so it is adding a good deal of unnecessary complexity. Here we remove
this complexity in favour of a much simpler ClusterBootstrapService
implementation that does all the work itself.

Today, the cluster bootstrapping process involves two (local) transport actions
in order to support a flexible bootstrapping API and to make it easily
accessible to plugins.

However this flexibility is not required for the current design so it is adding
a good deal of unnecessary complexity. This commit removes this complexity in
favour of a much simpler ClusterBootstrapService implementation that does all
the work itself.
@DaveCTurner DaveCTurner added >non-issue v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 15, 2019
@DaveCTurner DaveCTurner requested a review from ywelsch January 15, 2019 10:03
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

The merge conflict is trivial - the file can be deleted - but I am not using the current HEAD of master since it is not currently passing CI. I'll merge a more recent master once things go green there again.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplification here. I've left some comments on ClusterBootstrapService.


private void startBootstrap(Set<DiscoveryNode> discoveryNodes) {
if (bootstrappingPermitted.compareAndSet(true, false)) {
doBootstrap(new VotingConfiguration(discoveryNodes.stream().map(DiscoveryNode::getId).collect(Collectors.toSet())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Why does it put all discovered nodes into the bootstrapping configuration, and not only those that are actually matching the initialMasterNodes requirements? For the best-effort bootstrapping I understand, but for the settings-based bootstrapping, this is a bit unexpected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know these nodes are live (the PeerFinder heard from them within the last few seconds) and will reconfigure to an optimal configuration as soon as the master is elected, so I don't think it makes much difference. More master nodes is generally more robust, right?

Copy link
Contributor

@ywelsch ywelsch Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might also be unsafe. Assume I start up 5 nodes (A,B,C,D,E) with setting initial_master_nodes to A. Then it's possible that B will bootstrap with A,B,C and that D will bootstrap with A,D,E, both of which can have a non-overlapping quorum (B,C) and (D,E). I don't think we should do this in the case of using initialMasterNodes and stick exactly to the formal model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, you're quite right. In fact it's unsafe with 3 nodes and initial_master_nodes: A, because A might bootstrap using only itself and B might bootstrap with A,B,C so that {B,C} is a quorum disjoint from A's. Fixed in 526acba.

return new GetDiscoveredNodesResponse(in);
}
});
doBootstrap(votingConfiguration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it so important to use the same voting configuration for retrying? As long as it matches the requirements it should be good, no? In case where it's an automated best-effort bootstrap, we have no strong guarantees and keeping the same voting config does not matter either?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, but this is how we've modelled it. Retrying here is a fairly weird thing to be doing anyway, requiring one of the following:

  • the node is not bootstrapped, and also not a CANDIDATE
  • the node is no longer in touch with a quorum of peers
  • an I/O error occurred

I'm in two minds about whether to retry automatically at all, as opposed to simply logging the issue and letting the operator restart the node if they really need to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this some more and concluded that retrying was not a particularly helpful thing to do on an exception, so as of 2ff94a0 we no longer do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have retries, for the reason of making bootstrapping fault-tolerant. It would be bad if in a 3 node cluster, one node would go down during bootstrapping, and none of the 2 other nodes would complete the bootstrapping. Yet this is a possibilty, with node 1 going down and node 2 and 3 both use node 1 and themselves as initial voting config (because node 1 was shortly available during discovery).
Similarly I feel that a node turning to follower (because of follower-check, so does not have accepted a cluster state yet) is not a good reason to abort setting initial config. Given that we're not changing the cluster state anymore when setting initial config, I think we should change the check from (not be CANDIDATE) to (not have a voting configuration).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I reverted that commit in 9d2bc20.

We already only bootstrap if we do (not have a voting configuration) - this check is in addition to the check on the mode. And yet we do change the cluster state when setting the initial config, so I don't follow the reasoning. Do you mean to say that we don't change the cluster state version on bootstrapping?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to say that we don't change the cluster state version on bootstrapping?

yes.

The mode check might be superfluous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this comment is still unaddressed:

I think we should have retries, for the reason of making bootstrapping fault-tolerant. It would be bad if in a 3 node cluster, one node would go down during bootstrapping, and none of the 2 other nodes would complete the bootstrapping. Yet this is a possibilty, with node 1 going down and node 2 and 3 both use node 1 and themselves as initial voting config (because node 1 was shortly available during discovery).

I feel that this is an important situation to cover and we should also have CoordinatorTests for it, i.e., start 3-node cluster with initial_master_nodes set to all nodes, do a little runRandomly, then completely isolate one of the nodes, and see if the other 2 can form a cluster (stabilization).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the mode check in f30ab6e and we agreed to follow up on the potential availability issue later since it requires some careful thought.

@DaveCTurner DaveCTurner requested a review from ywelsch January 15, 2019 17:29
@DaveCTurner
Copy link
Contributor Author

Build failure looks like #37275; @elasticmachine please run the Gradle build tests 2

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run the Gradle build tests 2

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please:

  • run the Gradle build tests 1
  • run the Gradle build tests 2

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please:

  • run the Gradle build tests 1
  • run the Gradle build tests 2
  • run the default distro tests

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please:

  • run the Gradle build tests 1
  • run the Gradle build tests 2

1 similar comment
@DaveCTurner
Copy link
Contributor Author

@elasticmachine please:

  • run the Gradle build tests 1
  • run the Gradle build tests 2

Today when bootstrapping a Zen2 cluster we wait for every node in the
`initial_master_nodes` setting to be discovered, so that we can map the node
names or addresses in the `initial_master_nodes` list to their IDs for
inclusion in the initial voting configuration. This means that if any of the
expected master-eligible nodes fails to start then bootstrapping will not occur
and the cluster will not form. This is not ideal, and we would prefer the
cluster to bootstrap even if some of the master-eligible nodes do not start.

Safe bootstrapping requires that all pairs of quorums of all initial
configurations overlap, and this is particularly troublesome to ensure given
that nodes may be concurrently and independently attempting to bootstrap the
cluster. The solution is to bootstrap using an initial configuration whose size
matches the size of the expected set of master-eligible nodes, but with the
unknown IDs replaced by "placeholder" IDs that can never belong to any node.
Any quorum of received votes in any of these placeholder-laden initial
configurations is also a quorum of the "true" initial set of master-eligible
nodes, giving the guarantee that it intersects all other quorums that we
require.

Note that this change means that the initial configuration is not necessarily
robust to any node failures. Normally the cluster will form and then
auto-reconfigure to a more robust configuration in which the placeholder IDs
are replaced by the IDs of genuine nodes as they join the cluster; however if a
node fails between bootstrapping and this auto-reconfiguration then the cluster
may become unavailable. This we feel to be less likely than a node failing to
start at all.
@DaveCTurner DaveCTurner changed the title Simplify ClusterBootstrapService Bootstrap a Zen2 cluster once quorum is discovered Jan 18, 2019
SNAPSHOT_IN_PROGRESS_EXCEPTION(org.elasticsearch.snapshots.SnapshotInProgressException.class,
org.elasticsearch.snapshots.SnapshotInProgressException::new, 152, Version.V_7_0_0);
org.elasticsearch.snapshots.SnapshotInProgressException::new, 151, Version.V_7_0_0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

assert placeholderCount < discoveryNodes.size() : discoveryNodes.size() + " <= " + placeholderCount;
if (bootstrappingPermitted.compareAndSet(true, false)) {
doBootstrap(new VotingConfiguration(Stream.concat(discoveryNodes.stream().map(DiscoveryNode::getId),
Stream.generate(() -> BOOTSTRAP_PLACEHOLDER_PREFIX + UUIDs.randomBase64UUID(random)).limit(placeholderCount))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of randomly-generated UUIDs here, I think we should combine the BOOTSTRAP_PLACEHOLDER_PREFIX with the initial_master_nodes value here. This will make it clearer which entry we have a place-holder for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well, I did so in c716734 and also ensured that there are no duplicate requirements in eadc797.

return new GetDiscoveredNodesResponse(in);
}
});
doBootstrap(votingConfiguration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to say that we don't change the cluster state version on bootstrapping?

yes.

The mode check might be superfluous.

return new GetDiscoveredNodesResponse(in);
}
});
doBootstrap(votingConfiguration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this comment is still unaddressed:

I think we should have retries, for the reason of making bootstrapping fault-tolerant. It would be bad if in a 3 node cluster, one node would go down during bootstrapping, and none of the 2 other nodes would complete the bootstrapping. Yet this is a possibilty, with node 1 going down and node 2 and 3 both use node 1 and themselves as initial voting config (because node 1 was shortly available during discovery).

I feel that this is an important situation to cover and we should also have CoordinatorTests for it, i.e., start 3-node cluster with initial_master_nodes set to all nodes, do a little runRandomly, then completely isolate one of the nodes, and see if the other 2 can form a cluster (stabilization).

@DaveCTurner DaveCTurner requested a review from ywelsch January 21, 2019 16:55
}

final Set<DiscoveryNode> nodesMatchingRequirements = requirementMatchingResult.v1();
final List<String> unsatisfiedRequirements = requirementMatchingResult.v2();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you output matching and unsatisfiedRequirements at trace level here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, see 2878bd6

assert unsatisfiedRequirements.size() < discoveryNodes.size() : discoveryNodes + " smaller than " + unsatisfiedRequirements;
if (bootstrappingPermitted.compareAndSet(true, false)) {
doBootstrap(new VotingConfiguration(Stream.concat(discoveryNodes.stream().map(DiscoveryNode::getId),
unsatisfiedRequirements.stream().map(s -> BOOTSTRAP_PLACEHOLDER_PREFIX + s))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a - or something between BOOTSTRAP_PLACEHOLDER_PREFIX and requirement? Alternatively, add it directly to the PREFIX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, see b40144f. It already ended in punctuation (i.e. }) but a hyphen can't hurt too.

@DaveCTurner DaveCTurner merged commit 5db7ed2 into elastic:master Jan 22, 2019
@DaveCTurner DaveCTurner deleted the 2018-01-09-simple-cluster-bootstrap-service branch January 22, 2019 11:03
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 22, 2019
* elastic/master: (43 commits)
  Remove remaining occurances of "include_type_name=true" in docs (elastic#37646)
  SQL: Return Intervals in SQL format for CLI (elastic#37602)
  Publish to masters first (elastic#37673)
  Un-assign persistent tasks as nodes exit the cluster (elastic#37656)
  Fail start of non-data node if node has data (elastic#37347)
  Use cancel instead of timeout for aborting publications (elastic#37670)
  Follow stats api should return a 404 when requesting stats for a non existing index (elastic#37220)
  Remove deprecated FieldNamesFieldMapper.Builder#index (elastic#37305)
  Document that date math is locale independent
  Bootstrap a Zen2 cluster once quorum is discovered (elastic#37463)
  Upgrade to lucene-8.0.0-snapshot-83f9835. (elastic#37668)
  Mute failing test
  Fix java time formatters that round up (elastic#37604)
  Removes awaits fix as the fix is in. (elastic#37676)
  Mute failing test
  Ensure that max seq # is equal to the global checkpoint when creating ReadOnlyEngines (elastic#37426)
  Mute failing discovery disruption tests
  Add note about how the body is referenced (elastic#33935)
  Define constants for REST requests endpoints in tests (elastic#37610)
  Make prepare engine step of recovery source non-blocking (elastic#37573)
  ...
@ywelsch ywelsch mentioned this pull request Jan 22, 2019
61 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants