-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster state delay can cause endless index request loop #12573
Labels
>bug
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
Comments
here is a test that reproduces this: #12574 |
clintongormley
added
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
>bug
labels
Jan 26, 2016
I think this will be closed by #15900 |
I've opened #16274 to address this issue. |
Merged
ywelsch
pushed a commit
to ywelsch/elasticsearch
that referenced
this issue
Jul 7, 2016
…cal routing table Closes elastic#16274 Closes elastic#12573 Closes elastic#12574
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
When a primary is relocating from
node_1
tonode_2
, there can be a short time where the old primary is removed from the node already (closed, not deleted) but the new primary is still inPOST_RECOVERY
. In this state indexing requests might be sent back and forth betweennode_1
andnode_2
endlessly.Course of events:
primary (
[index][0]
) relocates fromnode_1
tonode_2
node_2
is done recovering, moves its shard toIndexShardState.POST_RECOVERY
and sends a message to master that the shard isShardRoutingState.STARTED
master receives shard started and updates cluster state to:
master sends this to
node_1
andnode_2
node_1
receives the new cluster state and removes its shard because it is not allocated onnode_1
anymoreindex a document
At this point
node_1
is already on cluster state 2 and does not have the shard anymore so it forwards the request tonode_2
. Butnode_2
is behind with cluster state processing, is still on cluster state 1 and therefore has the shard inIndexShardState.POST_RECOVERY
and thinksnode_1
has the primary. So it will send the request back tonode_1
. This goes on until eithernode_2
finally catches up and processes cluster state 2 or both nodes OOM.I will make a pull request with a test shortly
The text was updated successfully, but these errors were encountered: