-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate peer recovery from translog to retention lease #49448
Conversation
Pinging @elastic/es-distributed (:Distributed/Recovery) |
Hmm, a new test is failing. I am looking at it. |
qa/full-cluster-restart/src/test/java/org/elasticsearch/upgrades/FullClusterRestartIT.java
Outdated
Show resolved
Hide resolved
I have an implementation that fallbacks to translog if an index was created before 7.4, and the recovering replica does not have a PRRL. I think we should disable translog retention after every copy has established its PRRLs. However, this would require coordination. Another option is to make this decision locally. We also need to persist this decision so that we won't re-enable translog retention in a full cluster restart. WDYT? |
ReplicationTracker already has this field |
Please hold off the review as the test failure relates to this change. I will ping after I have resolved it. |
run elasticsearch-ci/packaging-sample-matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, Nhat! Overall looking very good already. I've left some minor comments.
server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ywelsch Thanks for reviewing. |
We turn off the translog retention policy asynchronously using the generic threadpool; hence, we need to assert busily here Relates #49448
Since 7.4, we switch from translog to Lucene as the source of history for peer recoveries. However, we reduce the likelihood of operation-based recoveries when performing a full cluster restart from pre-7.4 because existing copies do not have PPRL. To remedy this issue, we fallback using translog in peer recoveries if the recovering replica does not have a peer recovery retention lease, and the replication group hasn't fully migrated to PRRL. Relates #45136
Since 7.4, we switch from translog to Lucene as the source of history for peer recoveries. However, we reduce the likelihood of operation-based recoveries when performing a full cluster restart from pre-7.4 because existing copies do not have PPRL. To remedy this issue, we fallback using translog in peer recoveries if the recovering replica does not have a peer recovery retention lease, and the replication group hasn't fully migrated to PRRL. Relates elastic#45136
Since 7.4, we switch from translog to Lucene as the source of history for peer recoveries. However, we reduce the likelihood of operation-based recoveries when performing a full cluster restart from pre-7.4 because existing copies do not have PPRL. To remedy this issue, we fallback using translog in peer recoveries if the recovering replica does not have a peer recovery retention lease, and the replication group hasn't fully migrated to PRRL. Relates #45136
We need to make sure that the global checkpoints and peer recovery retention leases were advanced to the max_seq_no and synced; otherwise, we can risk expiring some peer recovery retention leases because of the file-based recovery threshold. Relates #49448
We need to make sure that the global checkpoints and peer recovery retention leases were advanced to the max_seq_no and synced; otherwise, we can risk expiring some peer recovery retention leases because of the file-based recovery threshold. Relates #49448
Since 7.4, we switch from translog to Lucene as the source of history for peer recoveries. However, we reduce the likelihood of operation-based recoveries when performing a full cluster restart from pre-7.4 because existing copies do not have PPRL. To remedy this issue, we fallback using translog in peer recoveries if the recovering replica does not have a peer recovery retention lease, and the replication group hasn't fully migrated to PRRL. Relates elastic#45136
We turn off the translog retention policy asynchronously using the generic threadpool; hence, we need to assert busily here Relates elastic#49448
Since 7.4, we switch from translog to Lucene as the source of history for peer recoveries. However, we reduce the likelihood of operation-based recoveries when performing a full cluster restart from pre-7.4 because existing copies do not have PPRL.
To remedy this issue, we fallback using translog in peer recoveries if the recovering replica does not have a peer recovery retention lease, and the replication group hasn't fully migrated to PRRL.
Relates #45136