-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance and cluster stability when lots of leases are expiring. #9360
Comments
We were able to observe this behavior outside of our application code by utilizing the following simple Go program. In pointing to package main
import "github.com/coreos/etcd/clientv3"
import "github.com/nu7hatch/gouuid"
import (
"log"
"time"
"context"
"sync"
)
func main() {
cli, err := clientv3.New(clientv3.Config{
Endpoints: []string{"localhost:2379"},
DialTimeout: 5 * time.Second,
})
log.Printf("Start")
if err != nil {
log.Fatal(err)
}
var wg sync.WaitGroup
for k := 0; k < 200; k++ {
wg.Add(1)
go func() {
for i := 0; i < 3000; i++ {
resp, err := cli.Grant(context.TODO(), 5)
if err != nil {
log.Fatal(err)
}
// after 5 seconds, the key 'foo' will be removed
u, err := uuid.NewV4()
_, err = cli.Put(context.TODO(), u.String(), "bar")//, clientv3.WithLease(resp.ID))
if err != nil {
log.Fatal(err)
}
}
defer wg.Done()
}()
}
wg.Wait()
defer cli.Close()
log.Printf("End")
} |
what is the version of your etcd server? |
@mgates How do all these leases expire at the same time? Did you restart the machine? |
We've been using 3.2, but we've tested on 3.3 had the same issue. |
The leases mostly expire one minute after creation, so different times, but the same TTL. We've certainly done rolling restarts of etcd, but we weren't doing it during the times when we've seen this. |
How many leases are there in total? Reproducible steps would be helpful. |
In our application usage, keys being set are the result of many concurrent API requests. An analogy would be a reservation system where locks are taken out on a resource and if that resource is not released, we are relying on lease expiration to enforce a TTL on the lock. In this way we naturally have many lease TTLs that expire concurrently without the attached keys being related. |
TL;DR We notice much worse performance when we allow the server to expire leases versus when we revoke them actively. We're trying to determine if using leases as a way to expire long lived keys is an anti-pattern or if we have encountered some unexpected performance bug. In our production system, we were doing approximately 1200 GRPC operations per second. Lease grants that went from approximately 400 to 620/second, with those additional 220 being leases that were not revoked. These lease durations ranged from 60 seconds to 8 hours. During this time our p99 for request duration ( We wanted to try to validate this by running a very minimal test outside of our application code or production infrastructure. We set up a cluster: 5 m3.large EC2 instances, running 3.2.0, with the data directory backed by tmpfs. We initially had EBS volumes with 2000 provisioned IOPS and wanted to use the ramdisk to eliminate the possibility of disk being the source of the issue, but found no real difference. 600,000 total loops granting a lease then setting a random key with value “bar” with that lease, lease lasts for 5 seconds. Total run time was 8 minutes. Peak lease grant per second peaked at 1893 per second and fell off to about 1300. Server side lease expiration peaked at 830 per second. Server side lease expirations continued for another 8 minutes after all operations sent to the cluster stopped. 3240 requests failed and we received several hundred “apply entry took too long” messages in the logs. This was all done using the very simple Go application provided above as a model for our real world application behavior. Hopefully this is some information towards reproducing the issue, but we will try to provide something more turn key as a test for reproduction tomorrow. |
@jcalvert Do you have any concurrent Lease API workloads (e.g. Lease Lookup, Lease List)? Or just overlapping Grant and Revoke? |
@gyuho No lookup, list, or heartbeat. |
v3 api is not measured by |
@xiang90 We saw quite a bit of movement on that metric. What would be a more appropriate metric to measure request time then? We saw an equivalent rise in our application side metric measurement where we wrap each request to etcd in a statsd timer. |
@jcalvert For v3, we have gRPC metrics e.g.
Or
|
Looking at the comment here, the hard limit on lease revoke rate and that the |
@jcalvert That limit is for restricting lease revoke spikes when a node restarts. This issue is more about I need to investigate more where bottlenecks are. Is this #9360 (comment) a good way to reproduce your issue? |
@gyuho Yes, although I believe it works more pronounced if the lease time is extended to 5 minutes. |
@jcalvert Ok, thanks. I will double-check our code base. |
Hey folks, I got some numbers. These are all on a one-node cluster with a ramdisk backing the data directory, and with the test running on the same VPS, so we shouldn't see network or disk latency impact it. I restarted the node and cleared out the data dir before each run. When we revoked the lease right away (https://gist.github.com/mgates/f79cbccf9f61ae5fd9d5c85d0b984b41) With a 5 second lease, where we deleted the key (https://gist.github.com/mgates/b866ab74625f4dd83eb6ebc20d988eff), but let the lease expire on it's own, it took 9 and a half minutes, and we had 225 of those messages. Increasing the lease to 60 second took 10.5 minutes, and had 441 of those messages. I can get you the full test logs if you want, or get more detail about how long the long-running actions took if it would help. |
Sure, full server logs + output of |
Here you go, let me know if you need more data or anything: https://gist.github.com/mgates/bcf10617030922980048dfa08f6f208f |
@mgates Thanks for the logs. Will take a look. /cc @ximenzaoshi |
Hey folks, we were thinking of taking a stab at implementing a backing heap suggested by the comment in https://github.com/coreos/etcd/blob/master/lease/lessor.go#L123. We just wanted to check if you had made any headway in a different direction before we dive in. |
Hey folks, we took a stab at some profiling and exploratory coding. It looks like there is a lot of CPU time spent checking the leases map:
We went ahead and implemented a heap to have a priority queue of leases to expire (as was suggested by the comment left in the code), and our performance for setting 200,000 keys with leases went up 25%, plus we lost all of our slow request warnings in the logs. We're going to get some more benchmarks tomorrow, and we'll hopefully have a pull request later this week, but we understand if you want to let if sit while the bug hunting continues. |
@mgates Sorry for delay. And thanks for working on the improvements! |
We're seeing similar issues etcd clusters serving kubernetes events where the lease count is in the 200k+ range. xref kubernetes/kubernetes#47532 |
@jpbetz Just a side note, we should really try not to create so many leases in Kubernetes since it is not necessary. |
@xiang90 I completely agree. At the very least we could improve to have a lease per time bucket (minute) and use that lease for all events for all events that should expire in that time bucket. This would reduce the default lease count to fixed number (e.g. 60 total if we did 1m buckets for 1hr of event TTL). |
Hello @mgates ! I'm wondering if there is a chance to get a pull request from your side ? :) |
Let's close it. And revisit if any other lease performance issue arises. |
We've been using etcdv3 leases that we let expire to add a TTL to keys. We've been noticing very poor performance with lease expiration vs manual key deletion. This is true both on our main 5-node cluster, and on a single node cluster we set up for testing. In particular, when lots of leases are expiring, request time, especially p99 request time goes through the roof, and we end up with a lot of request timeouts.
It also seems that the cluster gets behind, and leases don't expire on time too; keys are still there, but have a negative ttl.
We saw this in our production cluster (5 8-cpu VMs) where we paralyzed our cluster with about 3-400 lease expirations per second. It's able to handle well over that in set and delete requests.
We'd be happy to run any tests or gather any data that might be helpful.
The text was updated successfully, but these errors were encountered: