Enforce cluster UUIDs #37775

ywelsch · 2019-01-23T16:06:10Z

This PR adds join validation around cluster UUIDs, preventing a node to join a cluster if it was previously part of another cluster.

The PR introduces a new flag to the cluster state, clusterUUIDCommitted, which denotes whether the node has locked into a cluster with the given uuid. When a cluster is committed, this flag will turn to true, and subsequent cluster state updates will keep the information about committal.

Note that coordinating-only nodes are still free to switch clusters at will (after restart), as they don't carry any persistent state.

elasticmachine · 2019-01-23T16:06:12Z

Pinging @elastic/es-distributed

ywelsch · 2019-01-24T14:50:57Z

@DaveCTurner @andrershov this is ready for a first review

andrershov

I have added minor comments.
It also would be nice to add integration test, similar to testIndexImportedFromDataOnlyNodesIfMasterLostDataFolder or RecoveryFromGatewayIT.testTwoNodeFirstNodeCleared that will clear data folder of master eligible node and after that data node won't be able to join the cluster because join validation will fail due to clusterUUID mismatch.

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationState.java

andrershov · 2019-01-24T17:29:29Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+                final ClusterState localState = currentStateSupplier.get();
+                if (localState.metaData().clusterUUIDCommitted() &&
+                    localState.metaData().clusterUUID().equals(request.getState().metaData().clusterUUID()) == false) {
+                    throw new CoordinationStateRejectedException("join validation on cluster state" +


What is the reason why we log warning message in Coordinator, but do not log it here?

server/src/main/java/org/elasticsearch/cluster/metadata/MetaData.java

andrershov · 2019-01-24T17:58:05Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+            shiftedNode.getLocalNode(), n -> shiftedNode.persistedState);
+        cluster1.clusterNodes.add(newNode);
+
+        MockLogAppender mockAppender = new MockLogAppender();


It's a pity that we don't have a better way of understanding that join validation has failed, other than analyzing log output

+1 to that. Why can't we e.g. call cluster1.runFor(DEFAULT_STABILISATION_TIME) and then assert that the node didn't manage to join the cluster?

I've extended it to DEFAULT_STABILISATION_TIME, but I also want to make sure it fails for the right reason, hence keeping the logging check.

I'm all ears if you have better suggestions.

How about asserting that if we change nothing except the cluster UUID on disk then the node does join the cluster?

But there is no guarantee that it will if it has same term but higher version than master? I think these kinds of tests will be more useful after having the detach-cluster tool.

andrershov · 2019-01-24T18:06:44Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+                .coordinationMetaData(CoordinationMetaData.builder(metaData.coordinationMetaData())
+                    .term(0L).build())
+                .build(),
+            term -> 0L);


Seems that resetting term and currentTerm to 0 was required? shall we do the same in elasticsearch-node tool? What about version?

yes, this was required. I'll leave the definitive way to the tool (added TODO in c3e0aa2)

DaveCTurner

Looks good. I left a few small comments.

DaveCTurner · 2019-01-28T09:36:31Z

server/src/test/java/org/elasticsearch/gateway/ClusterStateUpdatersTests.java

@@ -268,7 +268,7 @@ public void testMixCurrentAndRecoveredState() {
        final ClusterState updatedState = mixCurrentStateAndRecoveredState(currentState, recoveredState);

        assertThat(updatedState.metaData().clusterUUID(), not(equalTo("_na_")));


Should really use UNKNOWN_CLUSTER_UUID here and elsewhere in the tests too.

see 3dc2c62

DaveCTurner · 2019-01-28T09:37:32Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationState.java

+                metaDataBuilder = MetaData.builder(lastAcceptedState.metaData());
+                metaDataBuilder.coordinationMetaData(coordinationMetaData);
+            }
+            if (lastAcceptedState.metaData().clusterUUID().equals(MetaData.UNKNOWN_CLUSTER_UUID) == false &&


Can we commit a state without a cluster UUID? Feels like this should be an assertion to me.

If the master node is a Zen1 node that has not recovered its state yet, that can unfortunately be the case (testMixedClusterFormation found this). I've added an assertion to that effect.

Can we write the assertion in a way that means we will have to remove it when Zen1 is no more? E.g. mention ZEN1_BWC_TERM?

done in 2697b44

DaveCTurner · 2019-01-28T09:41:51Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+        final ClusterNode shiftedNode = randomFrom(cluster2.clusterNodes).restartedNode();
+        final ClusterNode newNode = cluster1.new ClusterNode(nextNodeIndex.getAndIncrement(),
+            shiftedNode.getLocalNode(), n -> shiftedNode.persistedState);
+        cluster1.clusterNodes.add(newNode);


DaveCTurner · 2019-01-28T09:43:32Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+            shiftedNode.getLocalNode(), n -> shiftedNode.persistedState);
+        cluster1.clusterNodes.add(newNode);
+
+        MockLogAppender mockAppender = new MockLogAppender();


+1 to that. Why can't we e.g. call cluster1.runFor(DEFAULT_STABILISATION_TIME) and then assert that the node didn't manage to join the cluster?

ywelsch · 2019-01-29T09:18:17Z

@elasticmachine run elasticsearch-ci/1
@elasticmachine run elasticsearch-ci/default-distro

DaveCTurner

LGTM, but there are still some open questions from @andrershov.

andrershov

LGTM2, the only unanswered question is about logging, but I guess it's done this way to simplify testing.

ywelsch · 2019-01-29T10:53:20Z

@elasticmachine what have you done?

Today we fail to join a Zen2 cluster if the cluster UUID does not match our own, but we do not perform the same validation when joining a Zen1 cluster. This means that a Zen2 node will pass join validation and be added to a Zen1 cluster but will reject all cluster states from the master. Relates elastic#37775

Today we fail to join a Zen2 cluster if the cluster UUID does not match our own, but we do not perform the same validation when joining a Zen1 cluster. This means that a Zen2 node will pass join validation and be added to a Zen1 cluster but will reject all cluster states from the master. Relates #37775

This is a forward-port of parts of elastic#41063 to `master`, adding a test to show that join validation does indeed verify that the cluster UUIDs match. Relates elastic#37775

This is a forward-port of parts of #41063 to `master`, adding a test to show that join validation does indeed verify that the cluster UUIDs match. Relates #37775

This is a forward-port of parts of elastic#41063 to `master`, adding a test to show that join validation does indeed verify that the cluster UUIDs match. Relates elastic#37775

Enforce cluster uuid

9112741

ywelsch added >enhancement WIP :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 23, 2019

ywelsch requested review from andrershov and DaveCTurner January 23, 2019 16:06

ywelsch mentioned this pull request Jan 23, 2019

A new cluster coordination layer #32006

Closed

61 tasks

ywelsch added 6 commits January 24, 2019 09:09

Merge remote-tracking branch 'elastic/master' into enforce-clusteruuid

965cbe5

more tests

afa6c1b

wip

265eb44

seems to work now

7f8f97f

add warning log

f93e0b4

fix testMixedClusterFormation

c890ad4

ywelsch changed the title ~~[WIP] Enforce cluster UUIDs~~ Enforce cluster UUIDs Jan 24, 2019

ywelsch removed the WIP label Jan 24, 2019

andrershov reviewed Jan 24, 2019

View reviewed changes

ywelsch added 4 commits January 25, 2019 00:19

fix test

861ad05

Merge remote-tracking branch 'elastic/master' into enforce-clusteruuid

f67c82b

feedback

a0f3f79

add test

1280e22

DaveCTurner reviewed Jan 28, 2019

View reviewed changes

ywelsch added 5 commits January 28, 2019 11:25

use constant

3dc2c62

add assertion

23558df

longer fail time

b7be853

rewrite assertion using Zen1 BWC term

2697b44

Merge remote-tracking branch 'elastic/master' into enforce-clusteruuid

9994b41

ywelsch requested review from andrershov and DaveCTurner January 28, 2019 18:13

Merge remote-tracking branch 'elastic/master' into enforce-clusteruuid

48dc33f

add TODO

c3e0aa2

DaveCTurner approved these changes Jan 29, 2019

View reviewed changes

Merge remote-tracking branch 'elastic/master' into enforce-clusteruuid

a13ffe9

andrershov approved these changes Jan 29, 2019

View reviewed changes

Merge remote-tracking branch 'elastic/master' into enforce-clusteruuid

bcbeb33

ywelsch merged commit 3c9f703 into elastic:master Jan 29, 2019

DaveCTurner added the v7.0.0 label Jan 30, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

DaveCTurner mentioned this pull request Apr 10, 2019

Validate cluster UUID when joining Zen1 cluster #41063

Merged

DaveCTurner mentioned this pull request Apr 16, 2019

Test that join validation checks the cluster UUID #41250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce cluster UUIDs #37775

Enforce cluster UUIDs #37775

ywelsch commented Jan 23, 2019 •

edited

Loading

elasticmachine commented Jan 23, 2019

ywelsch commented Jan 24, 2019

andrershov left a comment

andrershov Jan 24, 2019

andrershov Jan 24, 2019

DaveCTurner Jan 28, 2019

ywelsch Jan 28, 2019

ywelsch Jan 28, 2019

DaveCTurner Jan 28, 2019

ywelsch Jan 28, 2019

andrershov Jan 24, 2019

ywelsch Jan 29, 2019

DaveCTurner left a comment

DaveCTurner Jan 28, 2019

ywelsch Jan 28, 2019

DaveCTurner Jan 28, 2019

ywelsch Jan 28, 2019

DaveCTurner Jan 28, 2019

ywelsch Jan 29, 2019

DaveCTurner Jan 28, 2019

DaveCTurner Jan 28, 2019

ywelsch commented Jan 29, 2019

DaveCTurner left a comment

andrershov left a comment

ywelsch commented Jan 29, 2019

		@@ -268,7 +268,7 @@ public void testMixCurrentAndRecoveredState() {
		final ClusterState updatedState = mixCurrentStateAndRecoveredState(currentState, recoveredState);

		assertThat(updatedState.metaData().clusterUUID(), not(equalTo("_na_")));

Enforce cluster UUIDs #37775

Enforce cluster UUIDs #37775

Conversation

ywelsch commented Jan 23, 2019 • edited Loading

elasticmachine commented Jan 23, 2019

ywelsch commented Jan 24, 2019

andrershov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch commented Jan 29, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

andrershov left a comment

Choose a reason for hiding this comment

ywelsch commented Jan 29, 2019

ywelsch commented Jan 23, 2019 •

edited

Loading