-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add elasticsearch-node detach-cluster tool #37979
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @andrershov. I've left some initial comments. Can we also have a test where we have a data node rejoin a freshly bootstrapped cluster (i.e. when all master-eligible nodes were lost).
Also, can you adapt the existing tests (testDanglingIndices + ???) to show that this allows us to port them to Zen2?
"--------------------------------------------------------------------------\n" + | ||
"\n" + | ||
"You should run this tool only if you have permanently lost all\n" + | ||
"your master-eligible nodes, and you cannot restore the cluster\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also recommend to use it when having lost a majority of master-eligible nodes, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below "or you have already run elasticsearch-node unsafe-bootstrap
...". Probably @DaveCTurner will have to come up with better wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me.
"your master-eligible nodes, and you cannot restore the cluster\n" + | ||
"from a snapshot, or you have already run `elasticsearch-node unsafe-bootstrap`\n" + | ||
"on master-eligible node that formed cluster with this node.\n" + | ||
"This tool can result in arbitrary data loss and should be\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps: Usage of this tool can result in data loss and should be a means of last resort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is just copied from unsafe-bootstrap command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, that still doesn't make it great :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original is ok, but suggest this as an alternative:
This tool can cause arbitrary data loss and its use should be your last resort.
final MetaData newMetaData = MetaData.builder(metaData) | ||
.version(0) | ||
.coordinationMetaData(coordinationMetaData) | ||
.clusterUUID(MetaData.UNKNOWN_CLUSTER_UUID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can keep the cluster uuid, and just set clusterUUIDCommitted
to false
MetaData.FORMAT.cleanupOldFiles(manifest.getGlobalGeneration(), dataPaths); | ||
throw new ElasticsearchException(WRITE_METADATA_EXCEPTION_MSG, e); | ||
} | ||
long newCurrentTerm = manifest.getCurrentTerm() + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is incrementing the term necessary given that we changed the cluster uuid? The node will anyway bump its term when electing itself. Let's only do the minimum changes required to the state.
.lastCommittedConfiguration(CoordinationMetaData.VotingConfiguration.MUST_JOIN_ELECTED_MASTER) | ||
.build(); | ||
final MetaData newMetaData = MetaData.builder(metaData) | ||
.version(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why set this to 0? Is this necessary?
.clusterUUIDCommitted(false) | ||
.build(); | ||
|
||
writeNewMetaData(terminal, manifest, 0, 0, metaData, newMetaData, dataPaths); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can keep the cluster state version, and just set the term to 0.
UnsafeBootstrap - do not increment current term DetachCluster - do not reset MetaData and cluster state versions, do not change clusterUUID
Replaces testIndexImportedFromDataOnlyNodesIfMasterLostDataFolder and testDanglingIndices
@ywelsch Thank you for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested some small changes to wording
@@ -325,6 +325,8 @@ public String toString() { | |||
public static class VotingConfiguration implements Writeable, ToXContentFragment { | |||
|
|||
public static final VotingConfiguration EMPTY_CONFIG = new VotingConfiguration(Collections.emptySet()); | |||
public static final VotingConfiguration MUST_JOIN_ELECTED_MASTER = new VotingConfiguration(Collections.singleton( | |||
"_must_join_elected_master_")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this deserves a special case in ClusterFormationFailureHelper
(and its tests) as it will yield a somewhat strange message as it is:
master not discovered or elected yet, an election requires a node with id [_must_join_elected_master_], have discovered [] which is not a quorum; discovery will continue using [] from hosts providers and [...] from last-known cluster state; node term 0, last-accepted version 0 in term 0
I suggest:
master not discovered yet and this node was detached from its previous cluster, have discovered []; discovery will continue using [] from hosts providers and [...] from last-known cluster state; node term 0, last-accepted version 0 in term 0
"your master-eligible nodes, and you cannot restore the cluster\n" + | ||
"from a snapshot, or you have already run `elasticsearch-node unsafe-bootstrap`\n" + | ||
"on master-eligible node that formed cluster with this node.\n" + | ||
"This tool can result in arbitrary data loss and should be\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original is ok, but suggest this as an alternative:
This tool can cause arbitrary data loss and its use should be your last resort.
"--------------------------------------------------------------------------\n" + | ||
"\n" + | ||
"You should run this tool only if you have permanently lost all\n" + | ||
"your master-eligible nodes, and you cannot restore the cluster\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me.
"You should run this tool only if you have permanently lost all\n" + | ||
"your master-eligible nodes, and you cannot restore the cluster\n" + | ||
"from a snapshot, or you have already run `elasticsearch-node unsafe-bootstrap`\n" + | ||
"on master-eligible node that formed cluster with this node.\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"on master-eligible node that formed cluster with this node.\n" + | |
"on a master-eligible node that formed a cluster with this node.\n" + |
"Do you want to proceed?\n"; | ||
|
||
public DetachClusterCommand() { | ||
super("Detaches this node from the cluster with old UUID, allowing it to join new cluster"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super("Detaches this node from the cluster with old UUID, allowing it to join new cluster"); | |
super("Detaches this node from its cluster, allowing it to unsafely join a new cluster"); |
@DaveCTurner thanks for your review! I've addressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you also address the TODO in testCannotJoinClusterWithDifferentUUID
with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
# Conflicts: # server/src/test/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterIT.java
run elasticsearch-ci/2 |
2 similar comments
run elasticsearch-ci/2 |
run elasticsearch-ci/2 |
# Conflicts: # server/src/test/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterIT.java
* master: (36 commits) Ensure joda compatibility in custom date formats (elastic#38171) Do not compute cardinality if the `terms` execution mode does not use `global_ordinals` (elastic#38169) Do not set timeout for IndexRequests in GatewayIndexStateIT (elastic#38147) Zen2ify testMasterFailoverDuringIndexingWithMappingChanges (elastic#38178) SQL: [Docs] Add limitation for aggregate functions on scalars (elastic#38186) Add elasticsearch-node detach-cluster command (elastic#37979) Add tests for fractional epoch parsing (elastic#38162) Enable bw tests for elastic#37871 and elastic#38032. (elastic#38167) Clear send behavior rule in CloseWhileRelocatingShardsIT (elastic#38159) Fix testCorruptedIndex (elastic#38161) Add finalReduce flag to SearchRequest (elastic#38104) Forbid negative field boosts in analyzed queries (elastic#37930) Remove AtomiFieldData#getLegacyFieldValues (elastic#38087) Universal cluster bootstrap method for tests with autoMinMasterNodes=false (elastic#38038) Fix FullClusterRestartIT.testHistoryUUIDIsAdded (elastic#38098) Replace joda time in ingest-common module (elastic#38088) Fix eclipse config for ssl-config (elastic#38096) Don't load global ordinals with the `map` execution_hint (elastic#37833) Relax fault detector in some disruption tests (elastic#38101) Fix java time epoch date formatters (elastic#37829) ...
Please remember to add all relevant labels (area label, version label(s) and change type label) on all PRs and please look for this as part of reviews. The release note generation process is made much harder when PRs are not labelled correctly. |
This commit adds the second part of
elasticsearch-node
tool -detach-cluster
command in addition tounsafe-bootstrap
command.Also, this commit changes the semantics of
unsafe-bootstrap
, nowunsafe-bootstrap
changes clusterUUID.So the algorithm of running
elasticsearch-node
tool is the following:run the
unsafe-bootstrap
command on it. If there are no survivedmaster-eligible nodes - skip this step.
detach-cluster
command on the remaining survived nodes.Detach cluster makes the following changes to the node metadata:
constant MUST_JOIN_ELECTED_MASTER, that prevents initial cluster
bootstrap.
ElasticsearchNodeCommand
base abstract class is introduced, becauseUnsafeBootstrapMasterCommand
andDetachClusterCommand
have a lot incommon.
Also, this commit adds "ordinal" parameter to both commands, because it's
impossible to write IT otherwise.
For MUST_JOIN_ELECTED_MASTER case special handling is introduced in
ClusterFormationFailureHelper
.Tests for both commands reside in
ElasticsearchNodeCommandIT
(renamedfrom
UnsafeBootstrapMasterIT
).