Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence-number-based replica allocation #46318

Closed
DaveCTurner opened this issue Sep 4, 2019 · 1 comment · Fixed by #46959
Closed

Sequence-number-based replica allocation #46318

DaveCTurner opened this issue Sep 4, 2019 · 1 comment · Fixed by #46959
Assignees
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement

Comments

@DaveCTurner
Copy link
Contributor

When allocating a replica we prefer to allocate it on a node that already has a copy of the shard that is as close as possible to the primary, so that it is as cheap as possible to bring the new replica in sync with the primary. Indeed if we find a copy that is identical to the primary then we cancel an ongoing recovery on the grounds that a copy which is identical to the primary needs no work to recover as a replica.

We determine "as close as possible" by comparing the files on disk between the primary and replica, and "identical" by comparing the sync_id markers that were added by a synced flush. These mechanisms date back to before the introduction of sequence numbers, and do not always result in the best replica allocations in the presence of sequence-number-based recoveries. For instance, if two shard copies were allocated when the index was created then we do not expect them to have any segments in common; if additionally the copies were not synced-flushed then the ReplicaShardAllocator will consider them as completely different even though they might only differ in a small number of operations and be very cheap to recover.

We can improve this by making the ReplicaShardAllocator sensitive to sequence numbers. In many cases we maintain a peer-recovery retention lease (#41536) for copies of a shard that could reasonably be recovered by copying missing operations, so we can use the existence of such a lease to decide that a shard copy will be cheap to recover. In other cases where a shard is read-only (frozen, closed, ...) we can use sequence number information to determine that two copies are identical.

@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) 7x labels Sep 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

dnhatn added a commit that referenced this issue Oct 13, 2019
With this change, shard allocation prefers allocating replicas on a node 
that already has a copy of the shard that is as close as possible to the
primary, so that it is as cheap as possible to bring the new replica in
sync with the primary. Furthermore, if we find a copy that is identical
to the primary then we cancel an ongoing recovery because the new copy
which is identical to the primary needs no work to recover as a replica.

We no longer need to perform a synced flush before performing a rolling 
upgrade or full cluster start with this improvement.

Closes #46318
dnhatn added a commit that referenced this issue Oct 14, 2019
With this change, shard allocation prefers allocating replicas on a node
that already has a copy of the shard that is as close as possible to the
primary, so that it is as cheap as possible to bring the new replica in
sync with the primary. Furthermore, if we find a copy that is identical
to the primary then we cancel an ongoing recovery because the new copy
which is identical to the primary needs no work to recover as a replica.

We no longer need to perform a synced flush before performing a rolling
upgrade or full cluster start with this improvement.

Closes #46318
howardhuanghua pushed a commit to TencentCloudES/elasticsearch that referenced this issue Oct 14, 2019
With this change, shard allocation prefers allocating replicas on a node 
that already has a copy of the shard that is as close as possible to the
primary, so that it is as cheap as possible to bring the new replica in
sync with the primary. Furthermore, if we find a copy that is identical
to the primary then we cancel an ongoing recovery because the new copy
which is identical to the primary needs no work to recover as a replica.

We no longer need to perform a synced flush before performing a rolling 
upgrade or full cluster start with this improvement.

Closes elastic#46318
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants