-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bootstrap a Zen2 cluster once quorum is discovered #37463
Bootstrap a Zen2 cluster once quorum is discovered #37463
Conversation
Today, the cluster bootstrapping process involves two (local) transport actions in order to support a flexible bootstrapping API and to make it easily accessible to plugins. However this flexibility is not required for the current design so it is adding a good deal of unnecessary complexity. This commit removes this complexity in favour of a much simpler ClusterBootstrapService implementation that does all the work itself.
Pinging @elastic/es-distributed |
The merge conflict is trivial - the file can be deleted - but I am not using the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplification here. I've left some comments on ClusterBootstrapService.
server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java
Outdated
Show resolved
Hide resolved
|
||
private void startBootstrap(Set<DiscoveryNode> discoveryNodes) { | ||
if (bootstrappingPermitted.compareAndSet(true, false)) { | ||
doBootstrap(new VotingConfiguration(discoveryNodes.stream().map(DiscoveryNode::getId).collect(Collectors.toSet()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Why does it put all discovered nodes into the bootstrapping configuration, and not only those that are actually matching the initialMasterNodes
requirements? For the best-effort bootstrapping I understand, but for the settings-based bootstrapping, this is a bit unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know these nodes are live (the PeerFinder
heard from them within the last few seconds) and will reconfigure to an optimal configuration as soon as the master is elected, so I don't think it makes much difference. More master nodes is generally more robust, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might also be unsafe. Assume I start up 5 nodes (A,B,C,D,E) with setting initial_master_nodes to A. Then it's possible that B will bootstrap with A,B,C and that D will bootstrap with A,D,E, both of which can have a non-overlapping quorum (B,C) and (D,E). I don't think we should do this in the case of using initialMasterNodes
and stick exactly to the formal model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, you're quite right. In fact it's unsafe with 3 nodes and initial_master_nodes: A
, because A
might bootstrap using only itself and B
might bootstrap with A,B,C
so that {B,C}
is a quorum disjoint from A
's. Fixed in 526acba.
server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java
Outdated
Show resolved
Hide resolved
return new GetDiscoveredNodesResponse(in); | ||
} | ||
}); | ||
doBootstrap(votingConfiguration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it so important to use the same voting configuration for retrying? As long as it matches the requirements it should be good, no? In case where it's an automated best-effort bootstrap, we have no strong guarantees and keeping the same voting config does not matter either?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, but this is how we've modelled it. Retrying here is a fairly weird thing to be doing anyway, requiring one of the following:
- the node is not bootstrapped, and also not a
CANDIDATE
- the node is no longer in touch with a quorum of peers
- an I/O error occurred
I'm in two minds about whether to retry automatically at all, as opposed to simply logging the issue and letting the operator restart the node if they really need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this some more and concluded that retrying was not a particularly helpful thing to do on an exception, so as of 2ff94a0 we no longer do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have retries, for the reason of making bootstrapping fault-tolerant. It would be bad if in a 3 node cluster, one node would go down during bootstrapping, and none of the 2 other nodes would complete the bootstrapping. Yet this is a possibilty, with node 1 going down and node 2 and 3 both use node 1 and themselves as initial voting config (because node 1 was shortly available during discovery).
Similarly I feel that a node turning to follower (because of follower-check, so does not have accepted a cluster state yet) is not a good reason to abort setting initial config. Given that we're not changing the cluster state anymore when setting initial config, I think we should change the check from (not be CANDIDATE) to (not have a voting configuration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I reverted that commit in 9d2bc20.
We already only bootstrap if we do (not have a voting configuration) - this check is in addition to the check on the mode. And yet we do change the cluster state when setting the initial config, so I don't follow the reasoning. Do you mean to say that we don't change the cluster state version on bootstrapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to say that we don't change the cluster state version on bootstrapping?
yes.
The mode check might be superfluous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this comment is still unaddressed:
I think we should have retries, for the reason of making bootstrapping fault-tolerant. It would be bad if in a 3 node cluster, one node would go down during bootstrapping, and none of the 2 other nodes would complete the bootstrapping. Yet this is a possibilty, with node 1 going down and node 2 and 3 both use node 1 and themselves as initial voting config (because node 1 was shortly available during discovery).
I feel that this is an important situation to cover and we should also have CoordinatorTests for it, i.e., start 3-node cluster with initial_master_nodes set to all nodes, do a little runRandomly, then completely isolate one of the nodes, and see if the other 2 can form a cluster (stabilization).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the mode check in f30ab6e and we agreed to follow up on the potential availability issue later since it requires some careful thought.
Build failure looks like #37275; @elasticmachine please run the Gradle build tests 2 |
@elasticmachine please run the Gradle build tests 2 |
@elasticmachine please:
|
@elasticmachine please:
|
@elasticmachine please:
|
1 similar comment
@elasticmachine please:
|
Today when bootstrapping a Zen2 cluster we wait for every node in the `initial_master_nodes` setting to be discovered, so that we can map the node names or addresses in the `initial_master_nodes` list to their IDs for inclusion in the initial voting configuration. This means that if any of the expected master-eligible nodes fails to start then bootstrapping will not occur and the cluster will not form. This is not ideal, and we would prefer the cluster to bootstrap even if some of the master-eligible nodes do not start. Safe bootstrapping requires that all pairs of quorums of all initial configurations overlap, and this is particularly troublesome to ensure given that nodes may be concurrently and independently attempting to bootstrap the cluster. The solution is to bootstrap using an initial configuration whose size matches the size of the expected set of master-eligible nodes, but with the unknown IDs replaced by "placeholder" IDs that can never belong to any node. Any quorum of received votes in any of these placeholder-laden initial configurations is also a quorum of the "true" initial set of master-eligible nodes, giving the guarantee that it intersects all other quorums that we require. Note that this change means that the initial configuration is not necessarily robust to any node failures. Normally the cluster will form and then auto-reconfigure to a more robust configuration in which the placeholder IDs are replaced by the IDs of genuine nodes as they join the cluster; however if a node fails between bootstrapping and this auto-reconfiguration then the cluster may become unavailable. This we feel to be less likely than a node failing to start at all.
SNAPSHOT_IN_PROGRESS_EXCEPTION(org.elasticsearch.snapshots.SnapshotInProgressException.class, | ||
org.elasticsearch.snapshots.SnapshotInProgressException::new, 152, Version.V_7_0_0); | ||
org.elasticsearch.snapshots.SnapshotInProgressException::new, 151, Version.V_7_0_0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
assert placeholderCount < discoveryNodes.size() : discoveryNodes.size() + " <= " + placeholderCount; | ||
if (bootstrappingPermitted.compareAndSet(true, false)) { | ||
doBootstrap(new VotingConfiguration(Stream.concat(discoveryNodes.stream().map(DiscoveryNode::getId), | ||
Stream.generate(() -> BOOTSTRAP_PLACEHOLDER_PREFIX + UUIDs.randomBase64UUID(random)).limit(placeholderCount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of randomly-generated UUIDs here, I think we should combine the BOOTSTRAP_PLACEHOLDER_PREFIX
with the initial_master_nodes
value here. This will make it clearer which entry we have a place-holder for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return new GetDiscoveredNodesResponse(in); | ||
} | ||
}); | ||
doBootstrap(votingConfiguration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to say that we don't change the cluster state version on bootstrapping?
yes.
The mode check might be superfluous.
return new GetDiscoveredNodesResponse(in); | ||
} | ||
}); | ||
doBootstrap(votingConfiguration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this comment is still unaddressed:
I think we should have retries, for the reason of making bootstrapping fault-tolerant. It would be bad if in a 3 node cluster, one node would go down during bootstrapping, and none of the 2 other nodes would complete the bootstrapping. Yet this is a possibilty, with node 1 going down and node 2 and 3 both use node 1 and themselves as initial voting config (because node 1 was shortly available during discovery).
I feel that this is an important situation to cover and we should also have CoordinatorTests for it, i.e., start 3-node cluster with initial_master_nodes set to all nodes, do a little runRandomly, then completely isolate one of the nodes, and see if the other 2 can form a cluster (stabilization).
} | ||
|
||
final Set<DiscoveryNode> nodesMatchingRequirements = requirementMatchingResult.v1(); | ||
final List<String> unsatisfiedRequirements = requirementMatchingResult.v2(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you output matching and unsatisfiedRequirements at trace level here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, see 2878bd6
assert unsatisfiedRequirements.size() < discoveryNodes.size() : discoveryNodes + " smaller than " + unsatisfiedRequirements; | ||
if (bootstrappingPermitted.compareAndSet(true, false)) { | ||
doBootstrap(new VotingConfiguration(Stream.concat(discoveryNodes.stream().map(DiscoveryNode::getId), | ||
unsatisfiedRequirements.stream().map(s -> BOOTSTRAP_PLACEHOLDER_PREFIX + s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a -
or something between BOOTSTRAP_PLACEHOLDER_PREFIX
and requirement? Alternatively, add it directly to the PREFIX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, see b40144f. It already ended in punctuation (i.e. }
) but a hyphen can't hurt too.
* elastic/master: (43 commits) Remove remaining occurances of "include_type_name=true" in docs (elastic#37646) SQL: Return Intervals in SQL format for CLI (elastic#37602) Publish to masters first (elastic#37673) Un-assign persistent tasks as nodes exit the cluster (elastic#37656) Fail start of non-data node if node has data (elastic#37347) Use cancel instead of timeout for aborting publications (elastic#37670) Follow stats api should return a 404 when requesting stats for a non existing index (elastic#37220) Remove deprecated FieldNamesFieldMapper.Builder#index (elastic#37305) Document that date math is locale independent Bootstrap a Zen2 cluster once quorum is discovered (elastic#37463) Upgrade to lucene-8.0.0-snapshot-83f9835. (elastic#37668) Mute failing test Fix java time formatters that round up (elastic#37604) Removes awaits fix as the fix is in. (elastic#37676) Mute failing test Ensure that max seq # is equal to the global checkpoint when creating ReadOnlyEngines (elastic#37426) Mute failing discovery disruption tests Add note about how the body is referenced (elastic#33935) Define constants for REST requests endpoints in tests (elastic#37610) Make prepare engine step of recovery source non-blocking (elastic#37573) ...
Today when bootstrapping a Zen2 cluster we wait for every node in the
initial_master_nodes
setting to be discovered, so that we can map thenode names or addresses in the
initial_master_nodes
list to their IDs forinclusion in the initial voting configuration. This means that if any of
the expected master-eligible nodes fails to start then bootstrapping will
not occur and the cluster will not form. This is not ideal, and we would
prefer the cluster to bootstrap even if some of the master-eligible nodes
do not start.
Safe bootstrapping requires that all pairs of quorums of all initial
configurations overlap, and this is particularly troublesome to ensure
given that nodes may be concurrently and independently attempting to
bootstrap the cluster. The solution is to bootstrap using an initial
configuration whose size matches the size of the expected set of
master-eligible nodes, but with the unknown IDs replaced by "placeholder"
IDs that can never belong to any node. Any quorum of received votes in any
of these placeholder-laden initial configurations is also a quorum of the
"true" initial set of master-eligible nodes, giving the guarantee that it
intersects all other quorums as required.
Note that this change means that the initial configuration is not
necessarily robust to any node failures. Normally the cluster will form and
then auto-reconfigure to a more robust configuration in which the
placeholder IDs are replaced by the IDs of genuine nodes as they join the
cluster; however if a node fails between bootstrapping and this
auto-reconfiguration then the cluster may become unavailable. This we feel
to be less likely than a node failing to start at all.
This commit also enormously simplifies the cluster bootstrapping process.
Today, the cluster bootstrapping process involves two (local) transport actions
in order to support a flexible bootstrapping API and to make it easily
accessible to plugins. However this flexibility is not required for the current
design so it is adding a good deal of unnecessary complexity. Here we remove
this complexity in favour of a much simpler ClusterBootstrapService
implementation that does all the work itself.