Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce resource needs of join validation #85380

Merged

Conversation

DaveCTurner
Copy link
Contributor

Fixes a few scalability issues around join validation:

  • compresses the cluster state sent over the wire
  • shares the serialized cluster state across multiple nodes
  • forks the decompression/deserialization work off the transport thread

Relates #77466
Closes #83204

@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.2.0 labels Mar 28, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

Fixes a few scalability issues around join validation:

- compresses the cluster state sent over the wire
- shares the serialized cluster state across multiple nodes
- forks the decompression/deserialization work off the transport thread

Relates elastic#77466
Closes elastic#83204
@DaveCTurner DaveCTurner force-pushed the 2022-03-28-join-validation-service branch from 40ba7c4 to 387ded0 Compare March 28, 2022 09:32
@DaveCTurner
Copy link
Contributor Author

Still needs work on the tests (see AwaitsFix annotations) but I'm opening this to get some CI attention. I doubt I'll get this in for 8.2.

@DaveCTurner DaveCTurner marked this pull request as ready for review April 4, 2022 11:06
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 4, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

Think this is good to go now.

Note that it changes the semantics of join validation slightly: we obviously don't send the latest state each time any more, it might be up to 60s stale. Also we used to be certain that the sender was marked as the master in the state we sent, but that's no longer the case since we get the state twice. I don't think these are meaningful changes, but worth pointing out anyway.

return ClusterState.readFrom(input, null);
}
} finally {
IOUtils.close(in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in might be already closed by line try (StreamInput input = in) in case when no exception. I wonder if it is possible to simplify code to have a single try with resources instead nested one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think you're right, I tidied this up in f9df3b9. This comes from the code in PublicationTransportHandler which dates back a long way (b80324d and beyond) and has accumulated a bunch of cruft over the years.

}

public ClusterState getState() throws IOException {
return stateSupplier.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be called multiple times?
If so then input would be read more then once. Should we consider LazyInitializable here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only called once so I think this is fine. I renamed it in fd908df to getOrReadState just to make it a bit clearer that this might be more than a simple getter.

Comment on lines +326 to +331
} finally {
if (success == false) {
assert false;
bytesStream.close();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be converted to an unconditional catch or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to close resources in a finally like this. For instance this will close the stream in tests even if an assertion trips, which we need to do to avoid tripping the leak detector later on. To get the same effect with a catch block we'd need to catch (Throwable t) which is not permitted.

Copy link
Contributor Author

@DaveCTurner DaveCTurner Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(on closer inspection I see that on an assertion error we would leak the rest of the queue anyway, see 3af680b)

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! Thanks David, sorry this took me so long to get to.
Just a couple of random comments. The doc one is the only one that is somewhat important I think :)

stream.setVersion(version);
clusterState.writeTo(stream);
} catch (IOException e) {
throw new ElasticsearchException("failed to serialize cluster state for publishing to node {}", e, discoveryNode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an assert false;? We should never have a state that we fail to serialize to some version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm possibly, especially with an IOException, but in general exceptions during serialization are the last line of defence if there's some new feature that simply can't be represented in an older version. It's definitely better to avoid using new features in mixed clusters where possible but it's not a hard constraint.

super.writeTo(out);
this.state.writeTo(out);
stateSupplier.get().writeTo(out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment, this change is still a winner regardless :) :

This is such an unfortunate pattern that we have in a couple of spots. Instead of copying the bytes we should really be able to just use them outright by incrementing ref count like we do for BytesTransportRequest during publication.
This effectively doubles the memory use relative what we actually need for the full state compressed on heap and still scales in O(N) in the number of nodes we work with.
I'll see if we can clean this up in some neat way now that we only have Netty to worry about :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB assert out.getVersion().before(Version.V_8_3_0); we only call this if the target node doesn't support the new wire format, i.e. the cluster is in the middle of a rolling upgrade. I don't think it's worth optimising this case.

public String toString() {
return cacheClearer + " after timeout";
}
}, cacheTimeout, ThreadPool.Names.CLUSTER_COORDINATION);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems cheap enough to just run on the scheduler thread?

Copy link
Contributor Author

@DaveCTurner DaveCTurner Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not guaranteed I think, a concurrent execute() call could enqueue an expensive JoinValidation before our cacheClearer which we end up running:

sequenceDiagram
    actor Us
    participant Queue
    actor Other thread
    Other thread ->> Queue: Add JoinValidation
    Us ->> Queue: Add cacheCleaner
    Queue ->>+ Us: queueSize.cas(0,1), run tasks
    Queue ->> Other thread: queueSize.cas(1, 2), stop
    Us ->> Queue: JoinValidation completed, queueSize.cas(2, 1), keep going
    Us ->>- Queue: cacheCleaner completed, queueSize.cas(1, 0), stop
Loading

@@ -199,6 +199,15 @@ Sets how long the master node waits for each cluster state update to be
completely published to all nodes, unless `discovery.type` is set to
`single-node`. The default value is `30s`. See <<cluster-state-publishing>>.

`cluster.join_validation.cache_timeout`::
(<<static-cluster-setting,Static>>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want to document this?

This seems like a really tricky implementation detail for one. Also, maybe we will evolve this functionality further and it becomes meaningless in the mid term and then we have to deal with the whole deprecation noise? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make it a setting just in case there's some unforeseen problem that can be fixed by changing it, and IMO it's a bit rubbish when a problem can be fixed by adjusting a setting that isn't documented. We quite rightly get asked to add docs for such settings later. Note also that this whole section of the docs does have the following disclaimer:

WARNING: If you adjust these settings then your cluster may not form correctly or may become unstable or intolerant of certain failures.

We're ok with making settings become no-ops if the implementation moves on, like e.g. we did with cluster.join.timeout in 7.x. That's not a breaking change.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I misread a couple of spots before -> stupid questions ... but all clear now :)

@DaveCTurner DaveCTurner merged commit 79f181d into elastic:master Apr 26, 2022
@DaveCTurner DaveCTurner deleted the 2022-03-28-join-validation-service branch April 26, 2022 11:15
@DaveCTurner
Copy link
Contributor Author

Thanks both!

arteam added a commit that referenced this pull request Oct 24, 2024
We introduced a new join validation protocol in #85380 (8.3), the legacy protocol can be removed in 9.0

Remove assertion that we run a version after 8.3.0
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
We introduced a new join validation protocol in elastic#85380 (8.3), the legacy protocol can be removed in 9.0

Remove assertion that we run a version after 8.3.0
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
We introduced a new join validation protocol in elastic#85380 (8.3), the legacy protocol can be removed in 9.0

Remove assertion that we run a version after 8.3.0
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Nov 20, 2024
Since 8.3.0 (elastic#85380) we have sent join-validation requests as a
`BytesTransportRequest` to facilitate sharing these large messages (and
the work needed to create them) amongst all nodes that join the cluster
at around the same time. For BwC with versions earlier than 8.3.0 we use
a `ValidateJoinRequest` class to represent the received data, whichever
scheme it uses. We no longer need to maintain this compatibility, so we
can use a bare `BytesTransportRequest` on both sender and receiver, and
therefore drop the `ValidateJoinRequest` adapter and the special-cased
assertion in `MockTransportService`.

Relates elastic#114808 which was reverted in elastic#117200.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Nov 21, 2024
Since 8.3.0 (elastic#85380) we have sent join-validation requests as a
`BytesTransportRequest` to facilitate sharing these large messages (and
the work needed to create them) amongst all nodes that join the cluster
at around the same time. For BwC with versions earlier than 8.3.0 we use
a `ValidateJoinRequest` class to represent the received data, whichever
scheme it uses. We no longer need to maintain this compatibility, so we
can use a bare `BytesTransportRequest` on both sender and receiver, and
therefore drop the `ValidateJoinRequest` adapter and the special-cased
assertion in `MockTransportService`.

Relates elastic#114808 which was reverted in elastic#117200.
elasticsearchmachine pushed a commit that referenced this pull request Nov 21, 2024
Since 8.3.0 (#85380) we have sent join-validation requests as a
`BytesTransportRequest` to facilitate sharing these large messages (and
the work needed to create them) amongst all nodes that join the cluster
at around the same time. For BwC with versions earlier than 8.3.0 we use
a `ValidateJoinRequest` class to represent the received data, whichever
scheme it uses. We no longer need to maintain this compatibility, so we
can use a bare `BytesTransportRequest` on both sender and receiver, and
therefore drop the `ValidateJoinRequest` adapter and the special-cased
assertion in `MockTransportService`.

Relates #114808 which was reverted in #117200.
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
Since 8.3.0 (elastic#85380) we have sent join-validation requests as a
`BytesTransportRequest` to facilitate sharing these large messages (and
the work needed to create them) amongst all nodes that join the cluster
at around the same time. For BwC with versions earlier than 8.3.0 we use
a `ValidateJoinRequest` class to represent the received data, whichever
scheme it uses. We no longer need to maintain this compatibility, so we
can use a bare `BytesTransportRequest` on both sender and receiver, and
therefore drop the `ValidateJoinRequest` adapter and the special-cased
assertion in `MockTransportService`.

Relates elastic#114808 which was reverted in elastic#117200.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A Node Joining a Cluster with a Large State Receives the Full Uncompressed State in a ValidateJoinRequest
6 participants