Multiple parallel Get Request can lead sending heartbeat to fail causing Keepalive to close. #8114

SuhasAnand · 2017-06-16T02:49:44Z

We have Producer(s) producing ~ 3Million Key Value Pairs (with TTL attached) with the intention that, iff the producer dies, then all the keys associated with the producer expires and is automatically removed. Consumers on the other hand, when they come up do a Get request on a prefix and will start watching @revision+1 after the Get.

In this setup, If a Get with prefix matches ~2 million keys, and multiple consumers at the same time request for them, then the cluster gets so busy that the producer starts seeing, error "etcdserver: request timed out" (which is very strange to me, as in why should a a bad GET request prevent a PUT from happening ?) and this (which is a bigger problem) is very soon followed by the KeepAlive (TTL) getting closed.

Also, IMO, there should be a way to request GET like a Elasticsearch scroll. i.e. it should be possible to say GET me 10000 keys that match this prefix followed by get me next 10000 keys that match the prefix... to day there is a way to say limit to N response, but there is no way as far as I know to say Give me the next N response ?

The setup is a 5 node dedicated baremetal etcd cluster.
This is the leader logs when it happens: apply entries took too long [20.677294594s for 1 entries]
Jun 15 16:06:31 etcd[5797]: avoid queries with large range/delete range!
Jun 15 16:06:32 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 32.54762ms)
Jun 15 16:06:32 etcd[5797]: server is likely overloaded
Jun 15 16:06:32 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 49.415013ms)
Jun 15 16:06:32 etcd[5797]: server is likely overloaded
Jun 15 16:06:32 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 49.560233ms)
Jun 15 16:06:32 etcd[5797]: server is likely overloaded
Jun 15 16:06:32 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 56.4179ms)
Jun 15 16:06:32 etcd[5797]: server is likely overloaded
Jun 15 16:06:32 etcd[5797]: apply entries took too long [236.105178ms for 1 entries]
Jun 15 16:06:32 etcd[5797]: avoid queries with large range/delete range!
Jun 15 16:07:42 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 169.795873ms)
Jun 15 16:07:42 etcd[5797]: server is likely overloaded
Jun 15 16:07:42 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 169.926856ms)
Jun 15 16:07:42 etcd[5797]: server is likely overloaded
Jun 15 16:07:42 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 169.959775ms)
Jun 15 16:07:42 etcd[5797]: server is likely overloaded
Jun 15 16:07:42 etcd[5797]: failed to send out heartbeat on time (exceeded the 100ms timeout for 169.99141ms)
Jun 15 16:07:42 etcd[5797]: server is likely overloaded
Jun 15 16:07:45 etcd[5797]: apply entries took too long [21.500452938s for 1 entries]
Jun 15 16:07:45 etcd[5797]: avoid queries with large range/delete range!
Jun 15 16:10:00 etcd[5797]: start to snapshot (applied: 27403129, lastsnap: 27398128)
Jun 15 16:10:00 etcd[5797]: saved snapshot at index 27403129
Jun 15 16:10:00 etcd[5797]: compacted raft log at 27398129
Jun 15 16:10:15 etcd[5797]: purged file /var/lib/etcd/member/snap/0000000000000009-0000000001a1c1bf.snap successfully
Jun 15 16:14:17 etcd[5797]: grpc: Server.processUnaryRPC failed to write status: stream error: code = Canceled desc = "context canceled"

heyitsanthony · 2017-06-16T03:28:47Z

@SuhasAnand etcd already supports pagination. See https://github.com/coreos/etcd/blob/44a6c2121bdf7e08c775abd13dd8cd6945cca40c/clientv3/mirror/syncer.go#L64

SuhasAnand · 2017-06-16T16:35:47Z

@heyitsanthony I am aware, etcd supports pagination, what I was looking for is somekind of a cached pagination to take reduce the load of cluster. What is happening even using the current pagination approach, is its possible for multiple consumers doing a Get request to bring down the cluster to stop accepting PUTS or refresh TTL/KeepAlive. All I am telling is its ok to throttle or delay get request, than to not be able to refresh TTL. For watch, there is etcd grpc-proxy, should there not be one for Range GET as well ?

heyitsanthony · 2017-06-16T16:46:58Z

@SuhasAnand what's in the Range RPC messages that are being sent to etcd / what options are enabled? also try upgrading to 3.2, it has some backend concurrency improvements that may help

xiang90 · 2017-06-16T16:56:57Z

What is happening even using the current pagination approach

have you tried it?

there is etcd grpc-proxy, should there not be one for Range GET as well ?

grpc-proxy also cache range if you do it correctly.

as @heyitsanthony mentioned, try pagination, try 3.2, try proxy, and make sure you do s-read not l-read with a specific revision when recovering.

SuhasAnand · 2017-06-16T17:28:29Z

have you tried it?

Yes, else the effect is instantaneous. However, in the example @heyitsanthony pointed to the batch limits is 1000, we are currently using a range limit of 100K, but in the worst case in recovering, if all the consumers (say 100s of them) start requesting then, we start seeing this issue.

grpc-proxy also cache range if you do it correctly.
and make sure you do s-read not l-read with a specific revision when recovering.

Will try it out...

xiang90 · 2017-06-16T17:30:10Z

100K

reduce that to 10k i guess... 100k is still too aggressive.

xiang90 · 2017-06-29T18:08:54Z

@SuhasAnand kindly ping. any result?

SuhasAnand · 2017-06-29T20:14:55Z

@xiang90 still running stress tests, I will update this by tomorrow evening at the latest

SuhasAnand · 2017-07-01T00:33:55Z

@xiang90 , upgraded both etcd client and server to latest 3.2.1.
grpc-proxy is still not production ready, came across lots of issues with it (will create separate issue for them later), so initially used grpc-proxy and later without it.

I see the same for 10K limit, reducing to 1k is still not a option because the consumers are watching on a range of 2M key, and limiting to 1k will increase the sync time from ~ 5-6 Minutes to > 33 minutes. And as we increase the number of consumers, even for the 1k case, we will start seeing the same issue.

xiang90 · 2017-07-05T02:21:36Z

upgraded both etcd client and server to latest 3.2.1.
grpc-proxy is still not production ready, came across lots of issues with it (will create separate issue for them later), so initially used grpc-proxy and later without it.

If you want to scale read, the proxy is the way to go. we will not try to optimize the core etcd for read too much really. For your use case, we should collaborate to make grpc-proxy works better.
The core etcd team has not put a lot of effort to improve grpc-proxy simply because most of our users do not need it. If you guys need it, please contribute to it.

stale · 2020-04-07T07:11:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

xiang90 added the kind/support label Jun 16, 2017

SuhasAnand changed the title ~~Multiple parallel Get Request can lead to sending heartbeat to fail causing Keepalive to close.~~ Multiple parallel Get Request can lead sending heartbeat to fail causing Keepalive to close. Jun 16, 2017

heyitsanthony assigned xiang90 Jun 20, 2017

xiang90 mentioned this issue Jul 3, 2017

Large range read blocks writes on 3.2.1? #8202

Closed

SuhasAnand mentioned this issue Jul 3, 2017

Question: Should a large get with WithSerializable() option block puts? #7719

Closed

gyuho removed the Support label Feb 25, 2018

gyuho unassigned xiang90 Feb 25, 2018

gyuho added the type/question label Feb 25, 2018

stale bot added the stale label Apr 7, 2020

stale bot closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple parallel Get Request can lead sending heartbeat to fail causing Keepalive to close. #8114

Multiple parallel Get Request can lead sending heartbeat to fail causing Keepalive to close. #8114

SuhasAnand commented Jun 16, 2017

heyitsanthony commented Jun 16, 2017

SuhasAnand commented Jun 16, 2017

heyitsanthony commented Jun 16, 2017

xiang90 commented Jun 16, 2017

SuhasAnand commented Jun 16, 2017 •

edited

Loading

xiang90 commented Jun 16, 2017

xiang90 commented Jun 29, 2017

SuhasAnand commented Jun 29, 2017

SuhasAnand commented Jul 1, 2017

xiang90 commented Jul 5, 2017

stale bot commented Apr 7, 2020

Multiple parallel Get Request can lead sending heartbeat to fail causing Keepalive to close. #8114

Multiple parallel Get Request can lead sending heartbeat to fail causing Keepalive to close. #8114

Comments

SuhasAnand commented Jun 16, 2017

heyitsanthony commented Jun 16, 2017

SuhasAnand commented Jun 16, 2017

heyitsanthony commented Jun 16, 2017

xiang90 commented Jun 16, 2017

SuhasAnand commented Jun 16, 2017 • edited Loading

xiang90 commented Jun 16, 2017

xiang90 commented Jun 29, 2017

SuhasAnand commented Jun 29, 2017

SuhasAnand commented Jul 1, 2017

xiang90 commented Jul 5, 2017

stale bot commented Apr 7, 2020

SuhasAnand commented Jun 16, 2017 •

edited

Loading