Primary relocation handoff #15900

ywelsch · 2016-01-11T19:22:36Z

When primary relocation completes, a cluster state is propagated that deactivates the old primary and marks the new primary as active. As cluster state changes are not applied synchronously on all nodes, there can be a time interval where the relocation target has processed the cluster state and believes to be the active primary and the relocation source has not yet processed the cluster state update and still believes itself to be the active primary. This PR ensures that, before completing the relocation, the relocation source deactivates writing to its store and delegates requests to the relocation target.

The change is motivated as follows:

We need to ensure that we only start writing data into the new primary once all the writes into the old primary have been completely replicated (among others to the new primary). This ensures that the new primary operates on the proper document version numbers. Document versions are increased when writing to the primary and then used on the replica to make sure that newer documents are not overridden by older documents (in the presence of concurrent replication). A scenario for this would be: Write document with id "K" and value "X" to old primary (gets version 1) and replicate it to new primary as well as replica. Assume that another document with id "K" but value "Y" is written on the new primary before the new primary gets the replicated write of "K" with value "X". Unaware of the other write it will then assign the same version number (namely 1) to the document with value "Y" and replicate it to the replica. Depending on the order in which replicated writes from old and new primary arrive at the replica, it will then either store "X" or "Y", which means that the new primary and the replica can become out of sync.
We have to ensure that no new writes are done on the old primary once we start writing into the new primary. This helps with the following scenario. Assume primary relocation completes and master broadcasts cluster state which now only contains the new primary. Due to the distributed nature of Elasticsearch, cluster states are not applied in full synchrony on all nodes. For a brief moment nodes in the cluster have a different view of which node is the primary. In particular, it's possible that the node holding the old primary (node A) still believes to be the primary whereas the node holding the new primary (node B) believes to be the primary as well. If we send a document to node B, it will get indexed into the new primary and acknowledged (but won't exist on the old primary). If we then issue a delete request for the same document to the node A (which can happen if we send requests round-robin to nodes), then that node will not find the document in its old primary and fail the request.

This PR (in combination with #19013) implements the following solution:

Before completing the relocation, node A (holding the primary relocation source) deactivates writing to its shard copy (and temporarily puts all new incoming requests for that shard into a queue), then waits for all ongoing operations to be fully replicated. Once that is done, it delegates all new incoming requests to node B (holding the new primary) and also sends all the elements in the queue there. It uses a special action to delegate requests to node B, which bypasses the standard reroute phase when accepting requests as standard rerouting is based on the current cluster state on the node. At that moment, indexing requests that directly go to the node B will still be rerouted back to node A with the old primary. This means that node A is still in charge of indexing, but will use the physical shard copy on node B to do so. Node B finally asks the master to activate the new primary (finish the relocation). The master then broadcasts a new cluster state where the old primary on node A is removed and the new primary on node B is active. It doesn't matter now in which order the cluster state is applied on the nodes A and B:

If the cluster state is first applied on the node B, both nodes will send their index requests to the shard copy that is on node B.
If the cluster state is first applied on node A, requests to node A will be rerouted to node B and requests to node B will be rerouted to node A. To prevent redirect loops during the time period where cluster states on node A and node B differ, Prevent TransportReplicationAction to route request based on stale local routing table #16274 makes requests that are coming from node A wait on node B until B has processed the cluster state where relocation is completed.

supersedes #15532

bleskes · 2016-01-11T19:41:16Z

Thanks @ywelsch . Can we open a PR with just the operation counter, extracted to it's own java file? We can work on the primary relocation after wards. Also I would love it if we can keep the counter naming to the universe out of IndexShard (for now). Let's try to keep the change small.

bleskes · 2016-01-21T10:48:53Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

@@ -995,18 +995,17 @@ protected boolean shouldExecuteReplication(Settings settings) {

    static class IndexShardReference implements Releasable {

-        final private IndexShard counter;
+        private final Releasable operationLock;


can we keep the shard reference? I know it's not needed here - but it serves as a proxy to the shard allowing to do more things - see https://github.com/elastic/elasticsearch/pull/15485/files#diff-a8aefbf42f29dc0fcc7c0a144863948eR1104

ywelsch · 2016-01-25T12:40:58Z

@bleskes I've updated the PR according to our discussion and considered the following 4 scenarios:

source node and target node on cluster state before relocation target is marked as started and shard on source node not yet marked as RELOCATED. This means that source node knows it is active primary but not yet relocated and target node knows it is primary and relocation target. Index requests to source node are indexed on source node and replicated to target node. Index requests to target node are rerouted to source node.
source node and target node on cluster state before relocation target is marked as started and shard on source node marked as RELOCATED. This means that source node knows it is relocated and target node knows it is a primary relocation target. Index requests to source node are sent in primary phase to target node and replicated back to source node. Index requests to target are rerouted back to source node.
source node on cluster state before shard of relocation target is marked as started and target node on cluster state with shard marked as started. This means that source node knows it is still active primary, but its shard has been marked as RELOCATED. Target node knows it is active primary as well. Index requests to source node are sent in primary phase to target node but not replicated back to source node. Index requests to target node are indexed directly on target node and not replicated to source node.
source node on cluster state where relocation target is marked as started and target node on cluster state where it is not yet started. This means that source node has closed its shard. Requests to source node are rerouted to target node. Requests to target node are rerouted to source node which are rerouted to target node and back and forth. This is addressed by subsequent patch.

bleskes · 2016-01-28T10:43:49Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

-                Tuple<Response, ReplicaRequest> primaryResponse = shardOperationOnPrimary(state.metaData(), request);
-                if (logger.isTraceEnabled()) {
-                    logger.trace("action [{}] completed on shard [{}] for request [{}] with cluster state version [{}]", transportPrimaryAction, shardId, request, state.version());
+                indexShardReference = getIndexShardOperationsCounter(shardId, true);


get we call the method getIndexShardReference? Also I like explicit naming here better then a primary boolean. See seq_no branch for an example

s1monw · 2016-02-01T11:20:08Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

@@ -995,22 +1028,43 @@ protected boolean shouldExecuteReplication(Settings settings) {
        return IndexMetaData.isIndexUsingShadowReplicas(settings) == false;
    }

-    static class IndexShardReference implements Releasable {
+    interface IndexShardReference extends Releasable {


since we only have on impl of this can we make it a concrete class and drop the interface?

The interface helps us to mock the implementation in the unit tests.

wait how can you not override a concrete class?

Technically it is possible, but not nice to do in this particular case. The implementation class acquires the lock in its constructor (yes, one can argue that this can be done in a separate method, but then the lock cannot be a final variable in the class anymore). Also the constructor takes an IndexShard, which is irrelevant for the class that we mock. I still prefer having a small interface as the current implementation class does not share any code / behavior with the mocked version in the tests.

s1monw · 2016-02-01T11:32:09Z

@ywelsch this looks awesome. I left some comments

s1monw · 2016-02-01T12:58:55Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

-        IndexShardReference(IndexShard counter) {
-            counter.incrementOperationCounter();
-            this.counter = counter;
+        IndexShardReferenceImpl(IndexShard indexShard, boolean primaryAction) {


I am curious isn't the boolean implicit from the IndexShard? I mean it's shard routing should say primary if the boolean is true no?

yes, but the converse does not hold. If we replicate an index request to a primary (e.g. a relocation target that is still recovering), then we take the replica operation lock. Currently, taking the primary or replica operation lock is pretty much the same thing. In the seq_no branch, this is not the case anymore. @bleskes suggested separating the methods in this PR already as this will make it easier to merge subsequent changes into the seq_no branch.

s1monw · 2016-02-01T13:05:15Z

I did another round with minor comments LGTM otherwise

Adds a container that represents a resource with reference counting capabilities. Provides operations to suspend acquisition of new references. Useful for resource management when resources are intermittently unavailable. Closes elastic#15956

When primary relocation completes, a cluster state is propagated that deactivates the old primary and marks the new primary as active. As cluster state changes are not applied synchronously on all nodes, there can be a time interval where the relocation target has processed the cluster state and believes to be the active primary and the relocation source has not yet processed the cluster state update and still believes itself to be the active primary. This commit ensures that, before completing the relocation, the reloction source deactivates writing to its store and delegates requests to the relocation target. Closes elastic#15900

Reverts back 7d3da91 after fix in elastic#15900 Closes elastic#8706

Reverts back 7d3da91 after fix in #15900 Closes #8706

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

There is a bug (document loss) with this which should be fixed by #15900 but it will not be backported so we should not test this.

ywelsch added review WIP :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v5.0.0-alpha1 labels Jan 11, 2016

ywelsch assigned bleskes Jan 11, 2016

clintongormley mentioned this pull request Jan 12, 2016

Implement proper handoff between primary copies during relocation #15532

Closed

clintongormley added the >enhancement label Jan 12, 2016

ywelsch mentioned this pull request Jan 13, 2016

Operation counter for IndexShard #15956

Closed

ywelsch force-pushed the fix/primary-relocation-handoff branch 3 times, most recently from d5a98ce to 9fa9742 Compare January 18, 2016 12:30

bleskes reviewed Jan 21, 2016
View reviewed changes

ywelsch force-pushed the fix/primary-relocation-handoff branch 3 times, most recently from 3bd3e8f to 22d57eb Compare January 25, 2016 12:33

clintongormley mentioned this pull request Jan 26, 2016

Cluster state delay can cause endless index request loop #12573

Closed

bleskes reviewed Jan 28, 2016
View reviewed changes

s1monw reviewed Feb 1, 2016
View reviewed changes

ywelsch force-pushed the fix/primary-relocation-handoff branch from badcd06 to 8e98d32 Compare February 1, 2016 12:14

s1monw reviewed Feb 1, 2016
View reviewed changes

ywelsch force-pushed the fix/primary-relocation-handoff branch from 8e98d32 to 6960158 Compare February 1, 2016 13:38

Yannick Welsch added 2 commits February 2, 2016 09:46

Add operation counter for IndexShard

e1006ea

Adds a container that represents a resource with reference counting capabilities. Provides operations to suspend acquisition of new references. Useful for resource management when resources are intermittently unavailable. Closes elastic#15956

ywelsch force-pushed the fix/primary-relocation-handoff branch from 6960158 to 10b5ffc Compare February 2, 2016 08:47

ywelsch merged commit 10b5ffc into elastic:master Feb 2, 2016

bleskes mentioned this pull request Feb 2, 2016

Illegal shard failure requests #16275

Closed

ywelsch pushed a commit to ywelsch/elasticsearch that referenced this pull request Feb 2, 2016

[TEST] Fail test if dummy doc is not found

33624c3

Reverts back 7d3da91 after fix in elastic#15900 Closes elastic#8706

ywelsch pushed a commit that referenced this pull request Feb 2, 2016

[TEST] Fail test if dummy doc is not found

cf28d62

Reverts back 7d3da91 after fix in #15900 Closes #8706

clintongormley added the resiliency label Feb 3, 2016

bleskes added a commit that referenced this pull request Apr 7, 2016

Update resliency page

557a3d1

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

bleskes mentioned this pull request Apr 7, 2016

Update resliency page #17586

Merged

bleskes added a commit that referenced this pull request Apr 7, 2016

Update resiliency page (#17586)

8eee28e

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

brwe added a commit that referenced this pull request May 26, 2016

[TEST] disable indexing while relocating

ffec1fc

There is a bug (document loss) with this which should be fixed by #15900 but it will not be backported so we should not test this.

brwe added a commit that referenced this pull request May 26, 2016

[TEST] disable indexing while relocating

df0c0d8

There is a bug (document loss) with this which should be fixed by #15900 but it will not be backported so we should not test this.

brwe added a commit that referenced this pull request May 26, 2016

[TEST] disable indexing while relocating

b92d539

There is a bug (document loss) with this which should be fixed by #15900 but it will not be backported so we should not test this.

brwe added a commit that referenced this pull request May 26, 2016

[TEST] disable indexing while relocating

8177907

There is a bug (document loss) with this which should be fixed by #15900 but it will not be backported so we should not test this.

ywelsch mentioned this pull request Jun 21, 2016

Non-blocking primary relocation hand-off #19013

Merged

ywelsch mentioned this pull request Jul 4, 2016

Indexing during primary relocation with ongoing replica recoveries can lead to documents not being properly replicated #19248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primary relocation handoff #15900

Primary relocation handoff #15900

ywelsch commented Jan 11, 2016 •

edited by abeyad

Loading

bleskes commented Jan 11, 2016

bleskes Jan 21, 2016

ywelsch commented Jan 25, 2016

bleskes Jan 28, 2016

s1monw Feb 1, 2016

ywelsch Feb 1, 2016

s1monw Feb 1, 2016

ywelsch Feb 1, 2016

s1monw commented Feb 1, 2016

s1monw Feb 1, 2016

ywelsch Feb 1, 2016

s1monw commented Feb 1, 2016

Primary relocation handoff #15900

Primary relocation handoff #15900

Conversation

ywelsch commented Jan 11, 2016 • edited by abeyad Loading

bleskes commented Jan 11, 2016

Choose a reason for hiding this comment

ywelsch commented Jan 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Feb 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Feb 1, 2016

ywelsch commented Jan 11, 2016 •

edited by abeyad

Loading