Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: KeepAlive() sends LeaseKeepAliveRequest every 500 ms when not consuming returned channel #7446

Closed
yudai opened this issue Mar 7, 2017 · 3 comments

Comments

@yudai
Copy link
Contributor

yudai commented Mar 7, 2017

Issue Observed

When user code issues client.KeepAlive() and does not consume the returned channel, *lessor.sendKeepAliveLoop() sends &etcdserverpb.LeaseKeepAliveRequest every 500 milli seconds, not respecting its TTL (and eventually overload the cluster).

Example Code:

grant, err := etcd.Grant(ctx, 60)
...snip...
 _, err = etcd.KeepAlive(ctx, grant.ID) // just ignoring the returned channel

Possible Cause

In *lessor.recvKeepAlive(), the logic does not update ka.nextKeepAlive when there is no writable channels attached to the ka, which means no body is consuming the channels, or the consumer goroutines are just too slow.

https://github.com/coreos/etcd/blob/master/clientv3/lease.go#L418

However, *lessor.sendKeepAliveLoop() uses ka.nextKeepAlive.Before(now) to pick up KeepAlives to be handled in each iteration. Therefore if the nextKeepAlive of a KeepAlive is not being updated, the KeepAlive is picked up every iteration, ignoring its TTL value.

https://github.com/coreos/etcd/blame/master/clientv3/lease.go#L454

Possible Resolution

I just moved the line of ka.nextKeepAlive = nextKeepAlive out of the loop and make it always evaluated, because I think there is no reason not to update the nextKeepAlive even if the channel is not writable. It seems the issue disappeared by this small change.

If the current behavior is intended, I think it's better to mention that users must consume the channel, because requests issued every 500ms can be a critical performance issue in a large environment.

etcd server version: v3.1.2
etcd client version: v3.1.2
OS: Linux (Ubuntu 16.04)

@heyitsanthony
Copy link
Contributor

This is intentional. If the consumer misses the keepalive then it'll also miss the TTL. The client will send another keepalive so that it can try to send the client an up-to-date TTL once it gets a response.

Another option is to cache the TTL and decrement the amount of time it waited until it could post to the channel.

I think it's better to mention that users must consume the channel, because requests issued every 500ms can be a critical performance issue in a large environment.

Yeah, it should probably be more clear about that.

@yudai
Copy link
Contributor Author

yudai commented Mar 9, 2017

@heyitsanthony Thank you for the reply. I understand the intention.

If possible, I think it would be better to have another way to deliver up-to-date TTLs to consumers' channels, because when you have 10k+ clients with some leases, the 500ms iteration can kill your cluster (I did with code like above). The current behavior could be a small, but critical trap for client users. As you suggested, doing something in the client side sounds better to me. Or, I feel like we can simply let consumers miss responses when it's busy or not consuming the channel.

@yudai yudai closed this as completed Mar 9, 2017
@xiang90
Copy link
Contributor

xiang90 commented Mar 9, 2017

If possible, I think it would be better to have another way to deliver up-to-date TTLs to consumers' channels, because when you have 10k+ clients with some leases, the 500ms iteration can kill your cluster (I did with code like above).

You might want to try the gRPC-proxy, which will effectively batch lease renew requests to reduce the load on the actual core servers. But, yea, 500ms might still be a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants