Reduce resource needs of join validation #85380

DaveCTurner · 2022-03-28T09:30:22Z

Fixes a few scalability issues around join validation:

compresses the cluster state sent over the wire
shares the serialized cluster state across multiple nodes
forks the decompression/deserialization work off the transport thread

elasticsearchmachine · 2022-03-28T09:30:46Z

Hi @DaveCTurner, I've created a changelog YAML for you.

Fixes a few scalability issues around join validation: - compresses the cluster state sent over the wire - shares the serialized cluster state across multiple nodes - forks the decompression/deserialization work off the transport thread Relates elastic#77466 Closes elastic#83204

DaveCTurner · 2022-03-28T09:34:45Z

Still needs work on the tests (see AwaitsFix annotations) but I'm opening this to get some CI attention. I doubt I'll get this in for 8.2.

…alidation-service

elasticmachine · 2022-04-04T11:06:47Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2022-04-04T11:09:18Z

Think this is good to go now.

Note that it changes the semantics of join validation slightly: we obviously don't send the latest state each time any more, it might be up to 60s stale. Also we used to be certain that the sender was marked as the master in the state we sent, but that's no longer the case since we get the state twice. I don't think these are meaningful changes, but worth pointing out anyway.

idegtiarenko · 2022-04-12T07:04:51Z

server/src/main/java/org/elasticsearch/cluster/coordination/ValidateJoinRequest.java

+                return ClusterState.readFrom(input, null);
+            }
+        } finally {
+            IOUtils.close(in);


I believe in might be already closed by line try (StreamInput input = in) in case when no exception. I wonder if it is possible to simplify code to have a single try with resources instead nested one

Yes I think you're right, I tidied this up in f9df3b9. This comes from the code in PublicationTransportHandler which dates back a long way (b80324d and beyond) and has accumulated a bunch of cruft over the years.

idegtiarenko · 2022-04-12T07:07:53Z

server/src/main/java/org/elasticsearch/cluster/coordination/ValidateJoinRequest.java

+    }
+
+    public ClusterState getState() throws IOException {
+        return stateSupplier.get();


Could this be called multiple times?
If so then input would be read more then once. Should we consider LazyInitializable here?

It is only called once so I think this is fine. I renamed it in fd908df to getOrReadState just to make it a bit clearer that this might be more than a simple getter.

idegtiarenko · 2022-04-12T07:43:52Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinValidationService.java

+        } finally {
+            if (success == false) {
+                assert false;
+                bytesStream.close();
+            }
+        }


Could this be converted to an unconditional catch or am I missing something?

I think it's better to close resources in a finally like this. For instance this will close the stream in tests even if an assertion trips, which we need to do to avoid tripping the leak detector later on. To get the same effect with a catch block we'd need to catch (Throwable t) which is not permitted.

(on closer inspection I see that on an assertion error we would leak the rest of the queue anyway, see 3af680b)

original-brownbear

Nice one! Thanks David, sorry this took me so long to get to.
Just a couple of random comments. The doc one is the only one that is somewhat important I think :)

original-brownbear · 2022-04-26T09:48:00Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinValidationService.java

+                stream.setVersion(version);
+                clusterState.writeTo(stream);
+            } catch (IOException e) {
+                throw new ElasticsearchException("failed to serialize cluster state for publishing to node {}", e, discoveryNode);


Shouldn't this be an assert false;? We should never have a state that we fail to serialize to some version?

Mmm possibly, especially with an IOException, but in general exceptions during serialization are the last line of defence if there's some new feature that simply can't be represented in an older version. It's definitely better to avoid using new features in mixed clusters where possible but it's not a hard constraint.

original-brownbear · 2022-04-26T09:53:54Z

server/src/main/java/org/elasticsearch/cluster/coordination/ValidateJoinRequest.java

        super.writeTo(out);
-        this.state.writeTo(out);
+        stateSupplier.get().writeTo(out);


Just a comment, this change is still a winner regardless :) :

This is such an unfortunate pattern that we have in a couple of spots. Instead of copying the bytes we should really be able to just use them outright by incrementing ref count like we do for BytesTransportRequest during publication.
This effectively doubles the memory use relative what we actually need for the full state compressed on heap and still scales in O(N) in the number of nodes we work with.
I'll see if we can clean this up in some neat way now that we only have Netty to worry about :)

NB assert out.getVersion().before(Version.V_8_3_0); we only call this if the target node doesn't support the new wire format, i.e. the cluster is in the middle of a rolling upgrade. I don't think it's worth optimising this case.

original-brownbear · 2022-04-26T09:56:22Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinValidationService.java

+                    public String toString() {
+                        return cacheClearer + " after timeout";
+                    }
+                }, cacheTimeout, ThreadPool.Names.CLUSTER_COORDINATION);


This seems cheap enough to just run on the scheduler thread?

It's not guaranteed I think, a concurrent execute() call could enqueue an expensive JoinValidation before our cacheClearer which we end up running:

sequenceDiagram actor Us participant Queue actor Other thread Other thread ->> Queue: Add JoinValidation Us ->> Queue: Add cacheCleaner Queue ->>+ Us: queueSize.cas(0,1), run tasks Queue ->> Other thread: queueSize.cas(1, 2), stop Us ->> Queue: JoinValidation completed, queueSize.cas(2, 1), keep going Us ->>- Queue: cacheCleaner completed, queueSize.cas(1, 0), stop

Loading

original-brownbear · 2022-04-26T10:04:39Z

docs/reference/modules/discovery/discovery-settings.asciidoc

@@ -199,6 +199,15 @@ Sets how long the master node waits for each cluster state update to be
 completely published to all nodes, unless `discovery.type` is set to
 `single-node`. The default value is `30s`. See <<cluster-state-publishing>>.

+`cluster.join_validation.cache_timeout`::
+(<<static-cluster-setting,Static>>)


Do we actually want to document this?

This seems like a really tricky implementation detail for one. Also, maybe we will evolve this functionality further and it becomes meaningless in the mid term and then we have to deal with the whole deprecation noise? :)

We make it a setting just in case there's some unforeseen problem that can be fixed by changing it, and IMO it's a bit rubbish when a problem can be fixed by adjusting a setting that isn't documented. We quite rightly get asked to add docs for such settings later. Note also that this whole section of the docs does have the following disclaimer:

WARNING: If you adjust these settings then your cluster may not form correctly or may become unstable or intolerant of certain failures.

We're ok with making settings become no-ops if the implementation moves on, like e.g. we did with cluster.join.timeout in 7.x. That's not a breaking change.

original-brownbear

LGTM! I misread a couple of spots before -> stupid questions ... but all clear now :)

DaveCTurner · 2022-04-26T11:16:00Z

Thanks both!

We introduced a new join validation protocol in #85380 (8.3), the legacy protocol can be removed in 9.0 Remove assertion that we run a version after 8.3.0

We introduced a new join validation protocol in elastic#85380 (8.3), the legacy protocol can be removed in 9.0 Remove assertion that we run a version after 8.3.0

Since 8.3.0 (elastic#85380) we have sent join-validation requests as a `BytesTransportRequest` to facilitate sharing these large messages (and the work needed to create them) amongst all nodes that join the cluster at around the same time. For BwC with versions earlier than 8.3.0 we use a `ValidateJoinRequest` class to represent the received data, whichever scheme it uses. We no longer need to maintain this compatibility, so we can use a bare `BytesTransportRequest` on both sender and receiver, and therefore drop the `ValidateJoinRequest` adapter and the special-cased assertion in `MockTransportService`. Relates elastic#114808 which was reverted in elastic#117200.

Since 8.3.0 (#85380) we have sent join-validation requests as a `BytesTransportRequest` to facilitate sharing these large messages (and the work needed to create them) amongst all nodes that join the cluster at around the same time. For BwC with versions earlier than 8.3.0 we use a `ValidateJoinRequest` class to represent the received data, whichever scheme it uses. We no longer need to maintain this compatibility, so we can use a bare `BytesTransportRequest` on both sender and receiver, and therefore drop the `ValidateJoinRequest` adapter and the special-cased assertion in `MockTransportService`. Relates #114808 which was reverted in #117200.

Since 8.3.0 (elastic#85380) we have sent join-validation requests as a `BytesTransportRequest` to facilitate sharing these large messages (and the work needed to create them) amongst all nodes that join the cluster at around the same time. For BwC with versions earlier than 8.3.0 we use a `ValidateJoinRequest` class to represent the received data, whichever scheme it uses. We no longer need to maintain this compatibility, so we can use a bare `BytesTransportRequest` on both sender and receiver, and therefore drop the `ValidateJoinRequest` adapter and the special-cased assertion in `MockTransportService`. Relates elastic#114808 which was reverted in elastic#117200.

DaveCTurner added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.2.0 labels Mar 28, 2022

DaveCTurner force-pushed the 2022-03-28-join-validation-service branch from 40ba7c4 to 387ded0 Compare March 28, 2022 09:32

public

31987bb

DaveCTurner mentioned this pull request Mar 28, 2022

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

DaveCTurner added 8 commits March 30, 2022 19:39

Merge remote-tracking branch 'upstream/master' into 2022-03-28-join-v…

4437bb9

…alidation-service

Bump version to 8.3

344aeca

Merge branch 'master' into 2022-03-28-join-validation-service

983b794

Migrate & fix tests

ead383a

Clean up properly in NodeJoinTests

53574c8

Drop trace logging

0a8d5a9

Merge branch 'master' into 2022-03-28-join-validation-service

416ba5d

Imports

5436d9a

DaveCTurner marked this pull request as ready for review April 4, 2022 11:06

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 4, 2022

DaveCTurner requested a review from original-brownbear April 4, 2022 11:06

idegtiarenko reviewed Apr 12, 2022

View reviewed changes

DaveCTurner added 3 commits April 12, 2022 08:20

Merge branch 'master' into 2022-03-28-join-validation-service

ccc87bc

Rename method to indicate it might be expensive

fd908df

Simplify try-with-resources stuff

f9df3b9

idegtiarenko reviewed Apr 12, 2022

View reviewed changes

DaveCTurner added 2 commits April 12, 2022 09:15

Comments

c14746b

Avoid more leaks on assertion errors

3af680b

idegtiarenko approved these changes Apr 13, 2022

View reviewed changes

original-brownbear reviewed Apr 26, 2022

View reviewed changes

original-brownbear approved these changes Apr 26, 2022

View reviewed changes

DaveCTurner merged commit 79f181d into elastic:master Apr 26, 2022

DaveCTurner deleted the 2022-03-28-join-validation-service branch April 26, 2022 11:15

This was referenced Aug 18, 2022

Always compress cluster state on transport layer #80104

Closed

Make org.elasticsearch.action.admin.cluster.state.ClusterStateResponse Compress the Cluster State #79906

Closed

DaveCTurner mentioned this pull request Dec 12, 2022

Fix ref-counting in DisruptableMockTransport #92245

Merged

arteam mentioned this pull request Oct 23, 2024

Remove legacy join validation transport protocol #114571

Merged

arteam added a commit that referenced this pull request Oct 24, 2024

Remove legacy join validation transport protocol (#114571)

f5d3c7c

We introduced a new join validation protocol in #85380 (8.3), the legacy protocol can be removed in 9.0 Remove assertion that we run a version after 8.3.0

DaveCTurner mentioned this pull request Nov 21, 2024

Remove ValidateJoinRequest #117225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce resource needs of join validation #85380

Reduce resource needs of join validation #85380

DaveCTurner commented Mar 28, 2022

elasticsearchmachine commented Mar 28, 2022

DaveCTurner commented Mar 28, 2022

elasticmachine commented Apr 4, 2022

DaveCTurner commented Apr 4, 2022

idegtiarenko Apr 12, 2022

DaveCTurner Apr 12, 2022

idegtiarenko Apr 12, 2022

DaveCTurner Apr 12, 2022

idegtiarenko Apr 12, 2022

DaveCTurner Apr 12, 2022

DaveCTurner Apr 12, 2022 •

edited

Loading

original-brownbear left a comment

original-brownbear Apr 26, 2022

DaveCTurner Apr 26, 2022

original-brownbear Apr 26, 2022

DaveCTurner Apr 26, 2022

original-brownbear Apr 26, 2022

DaveCTurner Apr 26, 2022 •

edited

Loading

original-brownbear Apr 26, 2022

DaveCTurner Apr 26, 2022

original-brownbear left a comment

DaveCTurner commented Apr 26, 2022

Reduce resource needs of join validation #85380

Reduce resource needs of join validation #85380

Conversation

DaveCTurner commented Mar 28, 2022

elasticsearchmachine commented Mar 28, 2022

DaveCTurner commented Mar 28, 2022

elasticmachine commented Apr 4, 2022

DaveCTurner commented Apr 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

original-brownbear left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner Apr 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear left a comment

Choose a reason for hiding this comment

DaveCTurner commented Apr 26, 2022

DaveCTurner Apr 12, 2022 •

edited

Loading

DaveCTurner Apr 26, 2022 •

edited

Loading