Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases #9924

Merged
merged 4 commits into from
Jul 24, 2018

Conversation

jpbetz
Copy link
Contributor

@jpbetz jpbetz commented Jul 14, 2018

Fixes #9888 by introducing a "lease checkpointing" mechanism.

The basic ideas is that for all leases with TTLs greater than 5 minutes, their remaining TTL will be checkpointed every 5 minutes so that if a new leader is elected, the leases are not auto-renewed to their full TTL, but instead only to the remaining TTL from the last checkpoint. A checkpoint is an entry that persisted to the RAFT consensus log that records the remainingTTL as determined by the leader when the checkpoint occurred.

If keep-alive is called on a lease that has been checkpointed. The remaining TTL will be cleared by a checkpoint entry in the RAFT consensus log where remainingTTL=0, indicating it is unset and that the original TTL should be used.

All checkpointing is scheduled and performed by the leader, and when a new leader is elected, it takes over checkpointing as part of lease.Promote.

An advantage of this approach is that leases where keep-alive is called often will still write at most two entries to the RAFT consensus log every 5 minutes since only the first keep-alive after a checkpoint must be recorded to the RAFT consensus log, all other keep-alives can be ignored.

Additionally, to prevent this mechanism from degrading system performance, it is designed to be best effort. There is a limit on how many checkpoints can be persisted per second, and how many pending checkpoint operations can be scheduled. If these limits are reached, checkpoints may not be scheduled or written to the RAFT consensus log to prevent the checkpointing operations from overwhelming the system, which could otherwise occur if large volumes of long lived leases were granted.

cc @gyuho @wenjiaswe @jingyih

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have another look next week as well. And just quick question from first pass, if findDueScheduledCheckpoints returns multiple leases, have we thought about batching them all in one raft request?

lease/lessor.go Outdated
@@ -57,6 +70,10 @@ type TxnDelete interface {
// RangeDeleter is a TxnDelete constructor.
type RangeDeleter func() TxnDelete

// Checkpointer permits checkpointing of lease remaining TTLs to the concensus log. Defined here to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/concensus/consensus/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

lease/lessor.go Outdated
}

// checkpointScheduledLeases finds all scheduled lease checkpoints that are due and
// submits them to the checkpointer to persist them to the concensus log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/concensus/consensus/ :)

@jpbetz
Copy link
Contributor Author

jpbetz commented Jul 16, 2018

@gyuho Batching only briefly crossed my mind, but it's something we should clearly do. I'll add it shortly.

lease/lessor.go Outdated
return l.remainingTTL
} else {
return l.ttl
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need else? just return return l.ttl following Go idioms? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! This is one of the hardest idoms to unlearn from other languages that do the opposite :)

@jpbetz jpbetz force-pushed the persist-lease-deadline branch 3 times, most recently from 9392bab to e463c07 Compare July 17, 2018 05:25
@jpbetz jpbetz changed the title [WIP] lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases Jul 17, 2018
@jpbetz jpbetz removed the WIP label Jul 17, 2018
@jpbetz
Copy link
Contributor Author

jpbetz commented Jul 17, 2018

Due to the size of this PR, I'll split it into three commits:

  • .proto change and resulting codegen
  • Lessor config and logging change
  • checkpointing mechanism

@jpbetz jpbetz force-pushed the persist-lease-deadline branch from e463c07 to ec26ef2 Compare July 17, 2018 20:22
@jpbetz
Copy link
Contributor Author

jpbetz commented Jul 17, 2018

lease/lessor.go Outdated
return cps
}
heap.Pop(&le.leaseCheckpointHeap)
if l, ok := le.leaseMap[lt.id]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need to remove a few indentations here.

if l, ok := ...; !ok {
    continue
}
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll flatten this down.

lease/lessor.go Outdated

// Limit the total number of scheduled checkpoints, checkpoint should be best effort and it is
// better to throttle checkpointing than to degrade performance.
maxScheduledCheckpoints = 10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we come up with these default values? have you done any benchmark?

would it be helpful if we make the checkpoint api accept multiple leases as a batch?

Copy link
Contributor Author

@jpbetz jpbetz Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we come up with these default values? have you done any benchmark?

Not yet, but I need to. I'm betting these number can be much higher. I'll do some benchmarking this week. For the scheduling, we just need to keep the heap size to some reasonable size, so I might look at typical etcd memory footprints and use that to help establish a limit that is based on the worst case memory utilization we're able to accept.

would it be helpful if we make the checkpoint api accept multiple leases as a batch?

We just added the batching of lease checkpointing yesterday (proto change) per @gyuho's suggestion. Since this is not clear from how the leaseCheckpointRate constant is defined, I'll clear that up with some code changes. Maybe by defining a maxLeaseCheckpointBatchSize and using leaseCheckpointRate to define how many patched checkpoint operations can occur per second, which I might set quite low once we have batching.

@xiang90
Copy link
Contributor

xiang90 commented Jul 17, 2018

The approach looks good to me. We need to have some benchmarks to show the overhead is acceptable in normal cases.

@jpbetz jpbetz force-pushed the persist-lease-deadline branch from ec26ef2 to 904b906 Compare July 17, 2018 22:07
@jpbetz
Copy link
Contributor Author

jpbetz commented Jul 17, 2018

The approach looks good to me. We need to have some benchmarks to show the overhead is acceptable in normal cases.

Thanks @xiang90. I'll post a full benchmark shortly.

@jpbetz jpbetz force-pushed the persist-lease-deadline branch from 904b906 to c939c0a Compare July 18, 2018 17:23
@codecov-io
Copy link

codecov-io commented Jul 18, 2018

Codecov Report

Merging #9924 into master will increase coverage by 0.03%.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9924      +/-   ##
==========================================
+ Coverage   68.99%   69.03%   +0.03%     
==========================================
  Files         386      386              
  Lines       35792    35891      +99     
==========================================
+ Hits        24695    24776      +81     
- Misses       9296     9300       +4     
- Partials     1801     1815      +14
Impacted Files Coverage Δ
etcdserver/config.go 79.51% <ø> (ø) ⬆️
lease/lease_queue.go 100% <100%> (ø) ⬆️
integration/cluster.go 82.17% <100%> (+0.05%) ⬆️
etcdserver/server.go 73.6% <100%> (+0.05%) ⬆️
clientv3/snapshot/v3_snapshot.go 64.75% <100%> (ø) ⬆️
etcdserver/apply.go 88.87% <75%> (-0.19%) ⬇️
lease/lessor.go 87.62% <90.35%> (+0.83%) ⬆️
client/keys.go 73.86% <0%> (-17.59%) ⬇️
pkg/tlsutil/tlsutil.go 86.2% <0%> (-6.9%) ⬇️
pkg/netutil/netutil.go 63.11% <0%> (-6.56%) ⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f725e1...d1de41e. Read the comment docs.

index int
id LeaseID
// Unix nanos timestamp.
time int64
Copy link
Contributor

@gyuho gyuho Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we comment this time field? It can be either expiration timestamp or checkpoint timestamp. Took me a while to find how time is used :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the field rename from expiration to time only got me from misleading to unclear. I'll add a comment and see if there is anything else I should do to make this more obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple comments to both lease_queue.go and the two places where the time field is used in lessor.go.

@jpbetz jpbetz force-pushed the persist-lease-deadline branch from c939c0a to 37b7484 Compare July 23, 2018 20:25
@jpbetz jpbetz force-pushed the persist-lease-deadline branch from 37b7484 to d1de41e Compare July 23, 2018 23:12
@jpbetz
Copy link
Contributor Author

jpbetz commented Jul 23, 2018

@xiang90 @gyuho

Ran two benchmarks:

Checkpoint heap size Benchmark

Checked etcd server heap size up to 10,000,000 live leases.

  • With checkpointing 3.3GB
  • Without checkpointing 3.3GB
    This makes sense given that the heap is a slice of structs that contain only three int64s, so the total memory usage for all the entries is only about 40MB or just a bit more than 1% of the total memory utilization. I've removed the limit on this heap as it does not seem to be needed.

Checkpoint rate limit Benchmark

Set leases to checkpoint every 1s, created 15k of them, and then checked server performance with benchmark put while the checkpointing is happening concurrently. This was with a 3-member etcd cluster on localhost.

  • Without checkpointing - write latency ~ 0.006ms
  • With checkpointing, maxLeaseCheckpointBatchSize=1 (no batching of checkpoints in RAFT log) - write latency ~0.015ms
  • With checkpointing, maxLeaseCheckpointBatchSize=1000 - write latency ~0.008ms
  • With checkpointing, maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=10000 - write latency ~0.008ms
  • With checkpointing, maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=1000 - write latency ~0.006ms

Since 1,000,000 checkpoints per sec seems sufficient, and the limits of maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=1000 appear to have negligible impact on performance, I've gone with those settings.

@gyuho
Copy link
Contributor

gyuho commented Jul 23, 2018

With checkpointing, maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=1000 - write latency ~0.006ms

Results look good to me. Thanks for benchmarks!

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm /cc @xiang90

@gyuho
Copy link
Contributor

gyuho commented Jul 23, 2018

@jpbetz Also, can you add this to CHANGELOG? Just separate commit or PR should be fine. Thanks.

@xiang90
Copy link
Contributor

xiang90 commented Jul 24, 2018

LGTM

@keeplowkey
Copy link

keeplowkey commented Dec 8, 2021

As a normal user, how do I use the "lease checkpointing" mechanism:
By upgrading the etcd server to a certain version or setting some specific parammeters in config file?I don't know.
Much thx for your help~ @jpbetz @xiang90 @gyuho

@serathius
Copy link
Member

We have recently found that Lease Checkpointing doesn't work as intended in #13491. Fix is planned to be released in v3.5.2. With this release you should be able to enable release checkpointing by providing --experimental-enable-lease-checkpoint and --experimental-enable-lease-checkpoint-persist flags

@keeplowkey
Copy link

We have recently found that Lease Checkpointing doesn't work as intended in #13491. Fix is planned to be released in v3.5.2. With this release you should be able to enable release checkpointing by providing --experimental-enable-lease-checkpoint and --experimental-enable-lease-checkpoint-persist flags

@serathius If I want to enable lease checkpointing, just start etcd server with flag "--experimental-enable-lease-checkpoint true" will do the trick? Current etcd version: 3.4.7. Looking forward to your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants