From 8eee28e798fdd8b4e6c18d7a908872f4829f9c00 Mon Sep 17 00:00:00 2001 From: Boaz Leskes Date: Thu, 7 Apr 2016 12:17:13 +0200 Subject: [PATCH] Update resiliency page (#17586) #14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0. --- docs/resiliency/index.asciidoc | 44 ++++++++++++++++++---------------- 1 file changed, 23 insertions(+), 21 deletions(-) diff --git a/docs/resiliency/index.asciidoc b/docs/resiliency/index.asciidoc index 938556f8428e5..24035e0772d8c 100644 --- a/docs/resiliency/index.asciidoc +++ b/docs/resiliency/index.asciidoc @@ -94,8 +94,27 @@ space. The following issues have been identified: Other safeguards are tracked in the meta-issue {GIT}11511[#11511]. + +[float] +=== Relocating shards omitted by reporting infrastructure (STATUS: ONGOING) + +Indices stats and indices segments requests reach out to all nodes that have shards of that index. Shards that have relocated from a node +while the stats request arrives will make that part of the request fail and are just ignored in the overall stats result. {GIT}13719[#13719] + +[float] +=== Jepsen Test Failures (STATUS: ONGOING) + +We have increased our test coverage to include scenarios tested by Jepsen. We make heavy use of randomization to expand on the scenarios that can be tested and to introduce new error conditions. You can follow the work on the master branch of the https://github.com/elastic/elasticsearch/blob/master/core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java[`DiscoveryWithServiceDisruptionsIT` class], where we will add more tests as time progresses. + +[float] +=== Document guarantees and handling of failure (STATUS: ONGOING) + +This status page is a start, but we can do a better job of explicitly documenting the processes at work in Elasticsearch, and what happens in the case of each type of failure. The plan is to have a test case that validates each behavior under simulated conditions. Every test will document the expected results, the associated test code and an explicit PASS or FAIL status for each simulated case. + +== Unreleased + [float] -=== Loss of documents during network partition (STATUS: ONGOING) +=== Loss of documents during network partition (STATUS: UNRELEASED, v5.0.0) If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults). @@ -103,7 +122,7 @@ If the node hosts a primary shard at the moment of partition, and ends up being To prevent this situation, the primary needs to wait for the master to acknowledge replica shard failures before acknowledging the write to the client. {GIT}14252[#14252] [float] -=== Safe primary relocations (STATUS: ONGOING) +=== Safe primary relocations (STATUS: UNRELEASED, v5.0.0) When primary relocation completes, a cluster state is propagated that deactivates the old primary and marks the new primary as active. As cluster state changes are not applied synchronously on all nodes, there can be a time interval where the relocation target has processed the @@ -117,23 +136,7 @@ on the relocation target, each of the nodes believes the other to be the active chasing the primary being quickly sent back and forth between the nodes, potentially making them both go OOM. {GIT}12573[#12573] [float] -=== Relocating shards omitted by reporting infrastructure (STATUS: ONGOING) - -Indices stats and indices segments requests reach out to all nodes that have shards of that index. Shards that have relocated from a node -while the stats request arrives will make that part of the request fail and are just ignored in the overall stats result. {GIT}13719[#13719] - -[float] -=== Jepsen Test Failures (STATUS: ONGOING) - -We have increased our test coverage to include scenarios tested by Jepsen. We make heavy use of randomization to expand on the scenarios that can be tested and to introduce new error conditions. You can follow the work on the master branch of the https://github.com/elastic/elasticsearch/blob/master/core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java[`DiscoveryWithServiceDisruptionsIT` class], where we will add more tests as time progresses. - -[float] -=== Document guarantees and handling of failure (STATUS: ONGOING) - -This status page is a start, but we can do a better job of explicitly documenting the processes at work in Elasticsearch, and what happens in the case of each type of failure. The plan is to have a test case that validates each behavior under simulated conditions. Every test will document the expected results, the associated test code and an explicit PASS or FAIL status for each simulated case. - -[float] -=== Do not allow stale shards to automatically be promoted to primary (STATUS: ONGOING, v5.0.0) +=== Do not allow stale shards to automatically be promoted to primary (STATUS: UNRELEASED, v5.0.0) In some scenarios, after the loss of all valid copies, a stale replica shard can be automatically assigned as a primary, preferring old data to no data at all ({GIT}14671[#14671]). This can lead to a loss of acknowledged writes if the valid copies are not lost but are rather @@ -143,7 +146,7 @@ for one of the good shard copies to reappear. In case where all good copies are stale shard copy. [float] -=== Make index creation resilient to index closing and full cluster crashes (STATUS: ONGOING, v5.0.0) +=== Make index creation resilient to index closing and full cluster crashes (STATUS: UNRELEASED, v5.0.0) Recovering an index requires a quorum (with an exception for 2) of shard copies to be available to allocate a primary. This means that a primary cannot be assigned if the cluster dies before enough shards have been allocated ({GIT}9126[#9126]). The same happens if an index @@ -153,7 +156,6 @@ recover an index in the presence of a single shard copy. Allocation IDs can also but none of the shards have been started. If such an index was inadvertently closed before at least one shard could be started, a fresh shard will be allocated upon reopening the index. -== Unreleased [float] === Use two phase commit for Cluster State publishing (STATUS: UNRELEASED, v5.0.0)