-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases #9924
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have another look next week as well. And just quick question from first pass, if findDueScheduledCheckpoints
returns multiple leases, have we thought about batching them all in one raft request?
lease/lessor.go
Outdated
@@ -57,6 +70,10 @@ type TxnDelete interface { | |||
// RangeDeleter is a TxnDelete constructor. | |||
type RangeDeleter func() TxnDelete | |||
|
|||
// Checkpointer permits checkpointing of lease remaining TTLs to the concensus log. Defined here to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/concensus/consensus/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
lease/lessor.go
Outdated
} | ||
|
||
// checkpointScheduledLeases finds all scheduled lease checkpoints that are due and | ||
// submits them to the checkpointer to persist them to the concensus log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/concensus/consensus/
:)
@gyuho Batching only briefly crossed my mind, but it's something we should clearly do. I'll add it shortly. |
lease/lessor.go
Outdated
return l.remainingTTL | ||
} else { | ||
return l.ttl | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need else
? just return return l.ttl
following Go idioms? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! This is one of the hardest idoms to unlearn from other languages that do the opposite :)
9392bab
to
e463c07
Compare
Due to the size of this PR, I'll split it into three commits:
|
e463c07
to
ec26ef2
Compare
For review, note that the PR now has 4 separate commits:
The main change is c81870f, which is only +243 -45 lines. |
lease/lessor.go
Outdated
return cps | ||
} | ||
heap.Pop(&le.leaseCheckpointHeap) | ||
if l, ok := le.leaseMap[lt.id]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably need to remove a few indentations here.
if l, ok := ...; !ok {
continue
}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I'll flatten this down.
lease/lessor.go
Outdated
|
||
// Limit the total number of scheduled checkpoints, checkpoint should be best effort and it is | ||
// better to throttle checkpointing than to degrade performance. | ||
maxScheduledCheckpoints = 10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we come up with these default values? have you done any benchmark?
would it be helpful if we make the checkpoint api accept multiple leases as a batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we come up with these default values? have you done any benchmark?
Not yet, but I need to. I'm betting these number can be much higher. I'll do some benchmarking this week. For the scheduling, we just need to keep the heap size to some reasonable size, so I might look at typical etcd memory footprints and use that to help establish a limit that is based on the worst case memory utilization we're able to accept.
would it be helpful if we make the checkpoint api accept multiple leases as a batch?
We just added the batching of lease checkpointing yesterday (proto change) per @gyuho's suggestion. Since this is not clear from how the leaseCheckpointRate
constant is defined, I'll clear that up with some code changes. Maybe by defining a maxLeaseCheckpointBatchSize
and using leaseCheckpointRate
to define how many patched checkpoint operations can occur per second, which I might set quite low once we have batching.
The approach looks good to me. We need to have some benchmarks to show the overhead is acceptable in normal cases. |
ec26ef2
to
904b906
Compare
Thanks @xiang90. I'll post a full benchmark shortly. |
904b906
to
c939c0a
Compare
Codecov Report
@@ Coverage Diff @@
## master #9924 +/- ##
==========================================
+ Coverage 68.99% 69.03% +0.03%
==========================================
Files 386 386
Lines 35792 35891 +99
==========================================
+ Hits 24695 24776 +81
- Misses 9296 9300 +4
- Partials 1801 1815 +14
Continue to review full report at Codecov.
|
index int | ||
id LeaseID | ||
// Unix nanos timestamp. | ||
time int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we comment this time
field? It can be either expiration timestamp or checkpoint timestamp. Took me a while to find how time
is used :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the field rename from expiration
to time
only got me from misleading to unclear. I'll add a comment and see if there is anything else I should do to make this more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple comments to both lease_queue.go
and the two places where the time field is used in lessor.go
.
c939c0a
to
37b7484
Compare
37b7484
to
d1de41e
Compare
Ran two benchmarks: Checkpoint heap size BenchmarkChecked etcd server heap size up to 10,000,000 live leases.
Checkpoint rate limit BenchmarkSet leases to checkpoint every 1s, created 15k of them, and then checked server performance with
Since 1,000,000 checkpoints per sec seems sufficient, and the limits of maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=1000 appear to have negligible impact on performance, I've gone with those settings. |
Results look good to me. Thanks for benchmarks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm /cc @xiang90
@jpbetz Also, can you add this to CHANGELOG? Just separate commit or PR should be fine. Thanks. |
LGTM |
We have recently found that Lease Checkpointing doesn't work as intended in #13491. Fix is planned to be released in v3.5.2. With this release you should be able to enable release checkpointing by providing |
@serathius If I want to enable lease checkpointing, just start etcd server with flag "--experimental-enable-lease-checkpoint true" will do the trick? Current etcd version: 3.4.7. Looking forward to your reply. |
Fixes #9888 by introducing a "lease checkpointing" mechanism.
The basic ideas is that for all leases with TTLs greater than 5 minutes, their remaining TTL will be checkpointed every 5 minutes so that if a new leader is elected, the leases are not auto-renewed to their full TTL, but instead only to the remaining TTL from the last checkpoint. A checkpoint is an entry that persisted to the RAFT consensus log that records the remainingTTL as determined by the leader when the checkpoint occurred.
If keep-alive is called on a lease that has been checkpointed. The remaining TTL will be cleared by a checkpoint entry in the RAFT consensus log where remainingTTL=0, indicating it is unset and that the original TTL should be used.
All checkpointing is scheduled and performed by the leader, and when a new leader is elected, it takes over checkpointing as part of
lease.Promote
.An advantage of this approach is that leases where keep-alive is called often will still write at most two entries to the RAFT consensus log every 5 minutes since only the first keep-alive after a checkpoint must be recorded to the RAFT consensus log, all other keep-alives can be ignored.
Additionally, to prevent this mechanism from degrading system performance, it is designed to be best effort. There is a limit on how many checkpoints can be persisted per second, and how many pending checkpoint operations can be scheduled. If these limits are reached, checkpoints may not be scheduled or written to the RAFT consensus log to prevent the checkpointing operations from overwhelming the system, which could otherwise occur if large volumes of long lived leases were granted.
cc @gyuho @wenjiaswe @jingyih