-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lease checkpoints fix #13491
Lease checkpoints fix #13491
Conversation
Current checkpointing mechanism is buggy. New checkpoints for any lease are scheduled only until the first leader change. Added fix for that and a test that will check it.
To extend lease checkpointing mechanism to cases when the whole etcd cluster is restarted. If etcd server has to restore its state from the raft logs, all LeaseCheckpoint requests will be applied to the server, regardles of the index value. This will set remaining TTLs to values from before the restart. Otherwise, remaining TTLs would be reset to initial TTLs after each cluster restart. Added integration test to cover this case.
…nto checkpoints-fix
cc @hexfusion |
Your PR includes 3 merge commits that unnecessary complicate git commit history, can you clean them up? (Happy to help you don't know how). |
One problem I see is inconsistent behavior depending on how much raft log is replayed. Checkpoints only impact state stored in memory and are not persistent. This means that the end ttl will depend on how much log is replayed. Etcd raft log is replayed from last snapshot, which is triggered every 10`000 entries (by default, can be changed), so whether checkpoints (done every 5 minutes) are available since last checkpoint will depend on how many proposals per second are handled by cluster. Change with forcing V3 apply for Checkpoint doesn't really solve the problem. Can you describe scenarios where replaying from raft log is needed? What failure scenarios we want to handle and in which ones replaying raft log is needed. |
@@ -446,6 +446,7 @@ func (le *lessor) Promote(extend time.Duration) { | |||
l.refresh(extend) | |||
item := &LeaseWithTime{id: l.ID, time: l.expiry} | |||
le.leaseExpiredNotifier.RegisterOrUpdate(item) | |||
le.scheduleCheckpointIfNeeded(l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found that commenting out this line doesn't break the tests. Can you add a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I wonder why `le.scheduleCheckpointIfNeeded(l)' in line 487 is not handling this case.
@@ -1901,6 +1901,11 @@ func (s *EtcdServer) applyEntryNormal(e *raftpb.Entry) { | |||
s.w.Trigger(r.ID, s.applyV2Request((*RequestV2)(rp), shouldApplyV3)) | |||
return | |||
} | |||
if !shouldApplyV3 && raftReq.LeaseCheckpoint != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes checkpoints depend whether there was a checkpoint since last snapshot. I think we should consider persisting remainingTTL from checkpoints into backend.
I find the current layout scary. A single object (lesser) that has parts of state:
And no protection that the post-raft code does not depend on pre-raft code, so risking introducing indeterminism. IMHO we should split the lesser into explicit 2-3 objects, where 2. & 3. can be only mutated by post-raft, and they do not depend on 1 in any way. The same way 3. cannot depend on 2. (as 2. is currently indeterministic). Looking at the @jpbetz desire (#9924), the goal was to minimize number of rafts + writes performed during renewal. Without persisting ExpirationTTL as part of the state, we land with fuzzy definition of 2., If I understand Marek Proposal, it would be actually merging 2&3:
|
Closing for #13508 |
This PR does 2 things:
Lease checkpointing is a mechanism (currently turned off by default) that is used to prevent lease TTLs from being reset to initial value after each leader change.
Currently, checkpoints are scheduled only until the first leader change. After any subsequent leader change TTL is set to the value from before the first leader change. To fix this, checkpoint scheduling has been forced for all leases after the leader change.
After fixing the first bug, lease checkpointing still stops working after a cluster restart, or any situation when server state has to be restored from the raft log. This is fixed by forcing etcd server to apply all LeaseCheckpoint requests it processes, which in the case of server restart means also LeaseCheckpoint requests that were already applied before the restart.
Another possible solution to this problem would be to store scheduled checkpoints in the KV store.
Integration tests have been improved to cover both issues.