-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A Node Joining a Cluster with a Large State Receives the Full Uncompressed State in a ValidateJoinRequest #83204
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
It gets worse if a load of nodes all try and join at the same time, for instance after a network partition heals, because then the master ends up needing to allocate O(100s of MBs) buffers for each node. The duplicated serialization work is not totally critical although it'd be nice to also solve that. Sharing the serialised state across concurrent validation requests like we do with publications would be good I think. A few months back I also took a look at ways to bound the memory usage to something more sensible in both validation and publication. I think we could buffer the serialised representation of the cluster state on disk, send it in bounded-size chunks, accumulate the chunks on disk on the receiver, and then stream the state straight from disk. Not sure how valuable this would be - if we can't even hold O(1) copy of the serialised cluster state in memory then we're probably close to breaking point in other ways too. I see other advantages to chunking without the disk buffering too, e.g. on some other failure it would let us bail out part-way through rather than continuing to push 100s of MBs of unnecessary data over the wire. |
Currently the serialized uncompressed state is actually significantly larger than the on-heap state due to (index-) setting deduplication. The on-heap state in the example cluster I used here is about 1/10 (50M) of those 500M. The problem here really is with the serialisation. The state barely grows by adding more indices due to the deduplications we do at this point. But the wire-format only takes advantage of them for mappings, but not for settings which hurt more than mappings to begin with because we store mappings serialized+compressed. I think ideally we'd do all of the above to fix this. Compress the message (I didn't try it out but I think that gets us about 80% heap savings in this example) and make for some chunked sending. This relates to the discussion I'd like to have on #82245 I think. If we were able to serialize and flush in chunks even without adjusting the wire format and also read in chunks, this would be a much smaller issue. |
Fixes a few scalability issues around join validation: - compresses the cluster state sent over the wire - shares the serialized cluster state across multiple nodes - forks the decompression/deserialization work off the transport thread Relates elastic#77466 Closes elastic#83204
The
ValidateJoinRequest
contains the cluster state uncompressed.This causes problems once the cluster state reaches a certain size. For one it requires a massive amount of memory even after #82608 but also, reading the full state on the transport thread outright (unlike with the publication handler that deserializes on
GENERIC
) is too slow.For a 40k indices cluster with beats mappings and an admittedly large number of data streams this is what happens:
We receive and deserialise a 500M+ message on the transport thread.
This becomes troublesome due the heap required just to buffer the message on a fresh master node that might otherwise be capable of handling this kind of cluster state (it's smaller on heap due to setting+mapping deduplication).
The slowness on the transport thread can mostly be blamed on the time it takes to read index settings.
This relates #80493 and setting deduplication in general. Ideally we should find a way of deduplicating the settings better to make the message smaller. Until that time a reasonable solution might be to simply compress the state in the message and read it as plain bytes, then deserialise on
GENERIC
like we do for the publication handler.An additional issue with this is that the master/sending node has to serialize this message in full which puts a problematic amount of strain on it potentially.
The text was updated successfully, but these errors were encountered: