Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple parallel Get Request can lead sending heartbeat to fail causing Keepalive to close. #8114

Closed
SuhasAnand opened this issue Jun 16, 2017 · 11 comments

Comments

@SuhasAnand
Copy link

We have Producer(s) producing ~ 3Million Key Value Pairs (with TTL attached) with the intention that, iff the producer dies, then all the keys associated with the producer expires and is automatically removed. Consumers on the other hand, when they come up do a Get request on a prefix and will start watching @revision+1 after the Get.

In this setup, If a Get with prefix matches ~2 million keys, and multiple consumers at the same time request for them, then the cluster gets so busy that the producer starts seeing, error "etcdserver: request timed out" (which is very strange to me, as in why should a a bad GET request prevent a PUT from happening ?) and this (which is a bigger problem) is very soon followed by the KeepAlive (TTL) getting closed.

Also, IMO, there should be a way to request GET like a Elasticsearch scroll. i.e. it should be possible to say GET me 10000 keys that match this prefix followed by get me next 10000 keys that match the prefix... to day there is a way to say limit to N response, but there is no way as far as I know to say Give me the next N response ?

The setup is a 5 node dedicated baremetal etcd cluster.
This is the leader logs when it happens: apply entries took too long [20.677294594s for 1 entries]
Jun 15 16:06:31 etcd[5797]: avoid queries with large range/delete range!
Jun 15 16:06:32 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 32.54762ms)
Jun 15 16:06:32 etcd[5797]: server is likely overloaded
Jun 15 16:06:32 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 49.415013ms)
Jun 15 16:06:32 etcd[5797]: server is likely overloaded
Jun 15 16:06:32 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 49.560233ms)
Jun 15 16:06:32 etcd[5797]: server is likely overloaded
Jun 15 16:06:32 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 56.4179ms)
Jun 15 16:06:32 etcd[5797]: server is likely overloaded
Jun 15 16:06:32 etcd[5797]: apply entries took too long [236.105178ms for 1 entries]
Jun 15 16:06:32 etcd[5797]: avoid queries with large range/delete range!
Jun 15 16:07:42 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 169.795873ms)
Jun 15 16:07:42 etcd[5797]: server is likely overloaded
Jun 15 16:07:42 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 169.926856ms)
Jun 15 16:07:42 etcd[5797]: server is likely overloaded
Jun 15 16:07:42 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 169.959775ms)
Jun 15 16:07:42 etcd[5797]: server is likely overloaded
Jun 15 16:07:42 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 169.99141ms)
Jun 15 16:07:42 etcd[5797]: server is likely overloaded
Jun 15 16:07:45 etcd[5797]: apply entries took too long [21.500452938s for 1 entries]
Jun 15 16:07:45 etcd[5797]: avoid queries with large range/delete range!
Jun 15 16:10:00 etcd[5797]: start to snapshot (applied: 27403129, lastsnap: 27398128)
Jun 15 16:10:00 etcd[5797]: saved snapshot at index 27403129
Jun 15 16:10:00 etcd[5797]: compacted raft log at 27398129
Jun 15 16:10:15 etcd[5797]: purged file /var/lib/etcd/member/snap/0000000000000009-0000000001a1c1bf.snap successfully
Jun 15 16:14:17 etcd[5797]: grpc: Server.processUnaryRPC failed to write status: stream error: code = Canceled desc = "context canceled"

@heyitsanthony
Copy link
Contributor

@SuhasAnand SuhasAnand changed the title Multiple parallel Get Request can lead to sending heartbeat to fail causing Keepalive to close. Multiple parallel Get Request can lead sending heartbeat to fail causing Keepalive to close. Jun 16, 2017
@SuhasAnand
Copy link
Author

@heyitsanthony I am aware, etcd supports pagination, what I was looking for is somekind of a cached pagination to take reduce the load of cluster. What is happening even using the current pagination approach, is its possible for multiple consumers doing a Get request to bring down the cluster to stop accepting PUTS or refresh TTL/KeepAlive. All I am telling is its ok to throttle or delay get request, than to not be able to refresh TTL. For watch, there is etcd grpc-proxy, should there not be one for Range GET as well ?

@heyitsanthony
Copy link
Contributor

@SuhasAnand what's in the Range RPC messages that are being sent to etcd / what options are enabled? also try upgrading to 3.2, it has some backend concurrency improvements that may help

@xiang90
Copy link
Contributor

xiang90 commented Jun 16, 2017

What is happening even using the current pagination approach

have you tried it?

there is etcd grpc-proxy, should there not be one for Range GET as well ?

grpc-proxy also cache range if you do it correctly.

as @heyitsanthony mentioned, try pagination, try 3.2, try proxy, and make sure you do s-read not l-read with a specific revision when recovering.

@SuhasAnand
Copy link
Author

SuhasAnand commented Jun 16, 2017

have you tried it?

Yes, else the effect is instantaneous. However, in the example @heyitsanthony pointed to the batch limits is 1000, we are currently using a range limit of 100K, but in the worst case in recovering, if all the consumers (say 100s of them) start requesting then, we start seeing this issue.

grpc-proxy also cache range if you do it correctly.
and make sure you do s-read not l-read with a specific revision when recovering.

Will try it out...

@xiang90
Copy link
Contributor

xiang90 commented Jun 16, 2017

100K

reduce that to 10k i guess... 100k is still too aggressive.

@xiang90
Copy link
Contributor

xiang90 commented Jun 29, 2017

@SuhasAnand kindly ping. any result?

@SuhasAnand
Copy link
Author

@xiang90 still running stress tests, I will update this by tomorrow evening at the latest

@SuhasAnand
Copy link
Author

@xiang90 , upgraded both etcd client and server to latest 3.2.1.
grpc-proxy is still not production ready, came across lots of issues with it (will create separate issue for them later), so initially used grpc-proxy and later without it.

I see the same for 10K limit, reducing to 1k is still not a option because the consumers are watching on a range of 2M key, and limiting to 1k will increase the sync time from ~ 5-6 Minutes to > 33 minutes. And as we increase the number of consumers, even for the 1k case, we will start seeing the same issue.

@xiang90
Copy link
Contributor

xiang90 commented Jul 5, 2017

upgraded both etcd client and server to latest 3.2.1.
grpc-proxy is still not production ready, came across lots of issues with it (will create separate issue for them later), so initially used grpc-proxy and later without it.

If you want to scale read, the proxy is the way to go. we will not try to optimize the core etcd for read too much really. For your use case, we should collaborate to make grpc-proxy works better.
The core etcd team has not put a lot of effort to improve grpc-proxy simply because most of our users do not need it. If you guys need it, please contribute to it.

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@stale stale bot closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants