Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move 'lost cluster state updates' issue to DONE #36959

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 27 additions & 16 deletions docs/resiliency/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -63,22 +63,6 @@ to create new scenarios. We have currently ported all published Jepsen scenarios
framework. As the Jepsen tests evolve, we will continue porting new scenarios that are not covered yet. We are committed to investigating
all new scenarios and will report issues that we find on this page and in our GitHub repository.

[float]
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)

During a networking partition, cluster state updates (like mapping changes or shard assignments)
are committed if a majority of the master-eligible nodes received the update correctly. This means that the current master has access
to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
up with the current state and receive the previously missed changes. However, if a second partition happens while the cluster
is still recovering from the previous one *and* the old master falls on the minority side, it may be that a new master is elected
which has not yet catch up. If that happens, cluster state updates can be lost.

This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes committed cluster state updates into account during master
election. This considerably reduces the chance of this rare problem occurring but does not fully mitigate it. If the second partition
happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be
that the in flight update will be lost. If the now-isolated master can still acknowledge the cluster state update to the client this
will amount to the loss of an acknowledged change. Fixing that last scenario needs considerable work. We are currently working on it but have no ETA yet.

[float]
=== Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)

Expand Down Expand Up @@ -170,6 +154,33 @@ shard.

== Completed

[float]
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: DONE, v7.0.0)

During a networking partition, cluster state updates (like mapping changes or
shard assignments) are committed if a majority of the master-eligible nodes
received the update correctly. This means that the current master has access to
enough nodes in the cluster to continue to operate correctly. When the network
partition heals, the isolated nodes catch up with the current state and receive
the previously missed changes. However, if a second partition happens while the
cluster is still recovering from the previous one *and* the old master falls on
the minority side, it may be that a new master is elected which has not yet
catch up. If that happens, cluster state updates can be lost.

This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes
committed cluster state updates into account during master election. This
considerably reduces the chance of this rare problem occurring but does not
fully mitigate it. If the second partition happens concurrently with a cluster
state update and blocks the cluster state commit message from reaching a
majority of nodes, it may be that the in flight update will be lost. If the
now-isolated master can still acknowledge the cluster state update to the client
this will amount to the loss of an acknowledged change.

Fixing this last scenario was one of the goals of {GIT}32006[#32006] and its
sub-issues. See particularly {GIT}32171[#32171] and
https://github.com/elastic/elasticsearch-formal-models/blob/master/ZenWithTerms/tla/ZenWithTerms.tla[the
TLA+ formal model] used to verify these changes.

[float]
=== Divergence between primary and replica shard copies when documents deleted (STATUS: DONE, V6.3.0)

Expand Down