-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clientv3: KeepAlive()
sends LeaseKeepAliveRequest
every 500 ms when not consuming returned channel
#7446
Comments
This is intentional. If the consumer misses the keepalive then it'll also miss the TTL. The client will send another keepalive so that it can try to send the client an up-to-date TTL once it gets a response. Another option is to cache the TTL and decrement the amount of time it waited until it could post to the channel.
Yeah, it should probably be more clear about that. |
@heyitsanthony Thank you for the reply. I understand the intention. If possible, I think it would be better to have another way to deliver up-to-date TTLs to consumers' channels, because when you have 10k+ clients with some leases, the 500ms iteration can kill your cluster (I did with code like above). The current behavior could be a small, but critical trap for client users. As you suggested, doing something in the client side sounds better to me. Or, I feel like we can simply let consumers miss responses when it's busy or not consuming the channel. |
You might want to try the gRPC-proxy, which will effectively batch lease renew requests to reduce the load on the actual core servers. But, yea, 500ms might still be a problem. |
Issue Observed
When user code issues
client.KeepAlive()
and does not consume the returned channel,*lessor.sendKeepAliveLoop()
sends&etcdserverpb.LeaseKeepAliveRequest
every 500 milli seconds, not respecting its TTL (and eventually overload the cluster).Example Code:
Possible Cause
In
*lessor.recvKeepAlive()
, the logic does not updateka.nextKeepAlive
when there is no writable channels attached to theka
, which means no body is consuming the channels, or the consumer goroutines are just too slow.https://github.com/coreos/etcd/blob/master/clientv3/lease.go#L418
However,
*lessor.sendKeepAliveLoop()
useska.nextKeepAlive.Before(now)
to pick up KeepAlives to be handled in each iteration. Therefore if thenextKeepAlive
of a KeepAlive is not being updated, the KeepAlive is picked up every iteration, ignoring its TTL value.https://github.com/coreos/etcd/blame/master/clientv3/lease.go#L454
Possible Resolution
I just moved the line of
ka.nextKeepAlive = nextKeepAlive
out of the loop and make it always evaluated, because I think there is no reason not to update thenextKeepAlive
even if the channel is not writable. It seems the issue disappeared by this small change.If the current behavior is intended, I think it's better to mention that users must consume the channel, because requests issued every 500ms can be a critical performance issue in a large environment.
etcd server version: v3.1.2
etcd client version: v3.1.2
OS: Linux (Ubuntu 16.04)
The text was updated successfully, but these errors were encountered: