clientv3: `KeepAlive()` sends `LeaseKeepAliveRequest` every 500 ms when not consuming returned channel #7446

yudai · 2017-03-07T23:23:02Z

Issue Observed

When user code issues client.KeepAlive() and does not consume the returned channel, *lessor.sendKeepAliveLoop() sends &etcdserverpb.LeaseKeepAliveRequest every 500 milli seconds, not respecting its TTL (and eventually overload the cluster).

Example Code:

grant, err := etcd.Grant(ctx, 60)
...snip...
 _, err = etcd.KeepAlive(ctx, grant.ID) // just ignoring the returned channel

Possible Cause

In *lessor.recvKeepAlive(), the logic does not update ka.nextKeepAlive when there is no writable channels attached to the ka, which means no body is consuming the channels, or the consumer goroutines are just too slow.

https://github.com/coreos/etcd/blob/master/clientv3/lease.go#L418

However, *lessor.sendKeepAliveLoop() uses ka.nextKeepAlive.Before(now) to pick up KeepAlives to be handled in each iteration. Therefore if the nextKeepAlive of a KeepAlive is not being updated, the KeepAlive is picked up every iteration, ignoring its TTL value.

https://github.com/coreos/etcd/blame/master/clientv3/lease.go#L454

Possible Resolution

I just moved the line of ka.nextKeepAlive = nextKeepAlive out of the loop and make it always evaluated, because I think there is no reason not to update the nextKeepAlive even if the channel is not writable. It seems the issue disappeared by this small change.

If the current behavior is intended, I think it's better to mention that users must consume the channel, because requests issued every 500ms can be a critical performance issue in a large environment.

etcd server version: v3.1.2
etcd client version: v3.1.2
OS: Linux (Ubuntu 16.04)

The text was updated successfully, but these errors were encountered:

heyitsanthony · 2017-03-07T23:42:53Z

This is intentional. If the consumer misses the keepalive then it'll also miss the TTL. The client will send another keepalive so that it can try to send the client an up-to-date TTL once it gets a response.

Another option is to cache the TTL and decrement the amount of time it waited until it could post to the channel.

I think it's better to mention that users must consume the channel, because requests issued every 500ms can be a critical performance issue in a large environment.

Yeah, it should probably be more clear about that.

yudai · 2017-03-09T22:31:50Z

@heyitsanthony Thank you for the reply. I understand the intention.

If possible, I think it would be better to have another way to deliver up-to-date TTLs to consumers' channels, because when you have 10k+ clients with some leases, the 500ms iteration can kill your cluster (I did with code like above). The current behavior could be a small, but critical trap for client users. As you suggested, doing something in the client side sounds better to me. Or, I feel like we can simply let consumers miss responses when it's busy or not consuming the channel.

xiang90 · 2017-03-09T22:33:18Z

If possible, I think it would be better to have another way to deliver up-to-date TTLs to consumers' channels, because when you have 10k+ clients with some leases, the 500ms iteration can kill your cluster (I did with code like above).

You might want to try the gRPC-proxy, which will effectively batch lease renew requests to reduce the load on the actual core servers. But, yea, 500ms might still be a problem.

yudai closed this as completed Mar 9, 2017

yudai mentioned this issue Jun 23, 2017

Does the user need to eat all keep alive messages? #8168

Closed

helinwang mentioned this issue Jun 23, 2017

fix etcd lease PaddlePaddle/Paddle#2573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clientv3: `KeepAlive()` sends `LeaseKeepAliveRequest` every 500 ms when not consuming returned channel #7446

clientv3: `KeepAlive()` sends `LeaseKeepAliveRequest` every 500 ms when not consuming returned channel #7446

yudai commented Mar 7, 2017

heyitsanthony commented Mar 7, 2017

yudai commented Mar 9, 2017

xiang90 commented Mar 9, 2017

clientv3: KeepAlive() sends LeaseKeepAliveRequest every 500 ms when not consuming returned channel #7446

clientv3: KeepAlive() sends LeaseKeepAliveRequest every 500 ms when not consuming returned channel #7446

Comments

yudai commented Mar 7, 2017

Issue Observed

Example Code:

Possible Cause

Possible Resolution

heyitsanthony commented Mar 7, 2017

yudai commented Mar 9, 2017

xiang90 commented Mar 9, 2017

clientv3: `KeepAlive()` sends `LeaseKeepAliveRequest` every 500 ms when not consuming returned channel #7446

clientv3: `KeepAlive()` sends `LeaseKeepAliveRequest` every 500 ms when not consuming returned channel #7446