ReliableFetch performs poorly when Redis server has many connections #2431

cainlevy · 2015-07-12T05:49:58Z

Blocking Operation Timeouts

When Redis has many connections, blocking operations can take unexpectedly long amounts of time to complete [https://github.com//issues/2311]. Redis enforces timeouts on the server by walking through the client list in chunks, guaranteeing only one loop every 10 seconds.

It looks like Redis will check 50 clients per hz (500 per second with the default hz=10) unless the total number of clients exceeds 5000 (hz=10). So with 500 clients or less, Redis will return from a blocking operation of N seconds within N + 1 seconds. With 2500 clients, for example, that goes up to N + 5 seconds.

The Redis client has a default timeout of 5.0 seconds. So if the server has 2500 connections, and returns from a blocking operation of 1 second within 1-6 seconds, we have a 20% chance of the client timing out.

We have a fairly ridiculous ~4300 connections right now, so blocking operations take up to ~9 seconds and have nearly a 50% chance of timeout. Here's to you, concurrency: 1. 🍹

Reliable Fetch:

When ReliableFetch idles with a blocking operation, jobs may be pushed into the queue after the client has timed out but before the server has checked the connection. The blocking operation succeeds, but no one is listening.

The good news is that the job still exists in the private queue. The bad news is that it will only be recovered when the process restarts. With stable processes, this could be a while. To make matters worse, since ReliableFetch idles by blocking on the most important queue, it will only misplace the most important jobs.

Possible Solution:

Maybe check the private queue after recovering from a fetch exception?

The text was updated successfully, but these errors were encountered:

mperham · 2015-07-12T14:45:55Z

Wow, 3.5 yrs of heavy Redis usage and I learn something new.

On Jul 11, 2015, at 22:49, Lance Ivy notifications@github.com wrote:

Blocking Operation Timeouts

When Redis has many connections, blocking operations can take unexpectedly long amounts of time to complete [#2311]. Redis enforces timeouts on the server by walking through the client list in chunks, guaranteeing only one loop every 10 seconds.

It looks like Redis will check 50 clients per hz (500 per second with the default hz=10) unless the total number of clients exceeds 5000 (hz=10). So with 500 clients or less, Redis will return from a blocking operation of N seconds within N + 1 seconds. With 2500 clients, for example, that goes up to N + 5 seconds.

The Redis client has a default timeout of 5.0 seconds. So if the server has 2500 connections, and returns from a blocking operation of 1 second within 1-6 seconds, we have a 20% chance of the client timing out.

We have a fairly ridiculous ~4300 connections right now, so blocking operations take up to ~9 seconds and have nearly a 50% chance of timeout. Here's to you, concurrency: 1.

Reliable Fetch:

When ReliableFetch idles with a blocking operation, jobs may be pushed into the queue after the client has timed out but before the server has checked the connection. The blocking operation succeeds, but no one is listening.

The good news is that the job still exists in the private queue. The bad news is that it will only be recovered when the process restarts. With stable processes, this could be a while. To make matters worse, since ReliableFetch idles by blocking on the most important queue, it will only misplace the most important jobs.

Possible Solution:

Maybe check the private queue after recovering from a fetch exception?

—
Reply to this email directly or view it on GitHub.

mperham · 2015-07-12T16:50:30Z

I think redis-cli info gives us all the info we need to warn the user that this is something they need to reconfigure.

hz:10
config_file:/usr/local/etc/redis.conf
connected_clients:1

The hz number is super confusing but it sounds like the net outcome is that hz is the upper bound on the number of seconds it takes Redis to check all connections for timeout. With large amounts of connections, hz needs to be lowered to 3-5. Something like this logic is conservative but wouldn't annoy smaller Sidekiq clusters:

puts "change #{config_file}" if connected_clients > 1000 and hz >= 10

WDYT?

mperham · 2015-07-12T17:02:30Z

BTW Sidekiq's default server connection pool sizing is very liberal. If you've got 25 worker threads, you can get away with size: 15 or even 10 but risk making it a contention point any smaller. This will reduce your overall connected_clients count.

cainlevy · 2015-07-12T18:50:58Z

Hmm, I get confused, do I want hz to be higher when I have more clients? The numclients/(hz*10) logic seems to guarantee a 10 second/loop maximum while the iterations < 50 branch seems to guarantee a 50 client/hz minimum. I was thinking about setting hz=25, so that Redis would check 1250 clients per second and only add 3-4 seconds of latency to blocking operations.

Do you think it'd be possible to add this logging right after a fetch exception timeout? It'd make the current error much more helpful, if people happen to find it. Logging it on boot may be misleading if the number of connected clients is still climbing after restarting the background cluster.

If this fetch timeout could be detected, I'd also be very interested in checking the private queues. Some combination of visibility (log messages), prevention (tuning hz), and mitigation (requeueing faster) could go a long ways.

This will reduce your overall connected_clients count.

Yeah, our client count is very overblown. It's partly a reflection of not splitting off a dedicated Redis host for our background queues, and partly my fault for going with a conservative concurrency: 1.

want_a_hertz_donut?

Hah, classic. 💥 🍩

cainlevy · 2015-07-12T18:56:15Z

Oh, one more thought: tuning hz to prevent this could be tricky with auto-scaling clusters. Seems easy to forget if you've added another 10% capacity.

Dare I ask: is the blocking operation worthwhile in this scenario? Pausing for a second would be more reliable, at the cost of possibly idling longer than necessary. I just can't imagine a single right answer here.

mperham · 2015-07-12T20:37:33Z

I really can't understand the Redis internal logic for hz; it's too complicated. The docs say we should increase hz to make it more precise but it looks like that means it will check less connections for each hz. So if we have 1000 connections and 10 hz, it will check 10 connections every 100 ms; all 1000 are checked in 10 sec. If we change to 20 hz, it will check 5 connections every 50 ms and all 1000 connections will still take 10 sec. What am I missing?

cainlevy · 2015-07-13T00:01:22Z

I think this bit is the key: https://github.com/antirez/redis/blob/b96af595a5fddbbdcbf78ed3c51acd60976416f4/src/redis.c#L984-L985. If I'm reading it right, Redis will only do less than 50 clients/hz when less than 50 clients exist.

mperham · 2015-07-15T15:51:21Z

Added this to the wiki, thoughts? https://github.com/mperham/sidekiq/wiki/Using-Redis#tuning

jonhyman · 2015-07-15T15:58:22Z

I'm curious if we should expose the ability to not use brpoplpush but rather an rpoplpush followed by a short sleep as @cainlevy suggested. Sidekiq-Pro could either detect this itself using info or have a config option.

We have a lot of connected clients as well. Taking a look at our brpoplpush in New Relic right now, that the standard deviation is more than 1 second (and the max is 5 seconds) indicates to me that we may be get better performance doing this. We use a custom Fetcher that subclasses ReliableFetch and we already make minimal use of brpoplpush in our Fetcher, but I'm thinking of changing it.

cainlevy · 2015-07-15T16:08:55Z

Added this to the wiki, thoughts? https://github.com/mperham/sidekiq/wiki/Using-Redis#tuning

👍 maybe link back to discussion for folks who want to understand more?

mperham · 2015-07-15T16:08:57Z

Keep in mind, this is still a problem for non-Pro people. Sidekiq uses BRPOP which has the same timeout issue. If you have lots of clients and use blocking operations, you need to increase hz.

cainlevy · 2015-07-15T16:12:42Z

Keep in mind, this is still a problem for non-Pro people. Sidekiq uses BRPOP which has the same timeout issue.

Oh, wow, yeah. And that algorithm doesn't have private queues to fall back on.

antirez · 2015-07-15T16:39:17Z

Hello, I'm on it, news in the next days. Thanks for tracking this issue!

ryansch · 2015-07-15T16:42:20Z

🎉

antirez · 2015-07-16T08:29:16Z

First feedbacks:

Yes the timeout changes a lot depending on the number of clients. Tuning hz is useful since for each call the function will try to process at least 50 clients, so with HZ=100 it processes 50*100 clients each second. With 5000 clients the average delay in returning is 0.5 seconds. However this is not good, we need something more deterministic and I'm fixing it.
I don't understand this part of @cainlevy report:

When ReliableFetch idles with a blocking operation, jobs may be pushed into the queue after the client has timed out but before the server has checked the connection. The blocking operation succeeds, but no one is listening

Even if the timeout can get longer, if the client is blocked and a message arrives into a list, the client is correctly unblocked and the message sent to the client. If you are seeing something different, maybe it's up to the library client implementation that when detecting the connection is broken does not care to deliver the last (received) reply to the caller? Otherwise if I'm wrong, please could you send me more details?

For now I'm fixing the non deterministic timeout behavior, that should not depend on HZ so much: it may make a small rounding error difference, but the return time should be the same, and HZ should only tune how much work it does per iteration, so that if you use an higher HZ, you get the same work per second, but more evenly distributed across the second. As it is today it dramatically changes the total work it does, which is bad.

News ASAP. Thanks.

antirez · 2015-07-16T09:07:06Z

I just pushed a fix for the timeout into the 3.0 branch (and others). It will be part of Redis 3.0.3 that will be released in the next 24h. Please could you check how things improved for you? I did my testings and everything looks fine now. More information in the commit message here:

redis/redis@25e1cb3

cainlevy · 2015-07-16T13:04:55Z

@antirez that was quick! The commit looks great. I'll aim to test it out today.

Even if the timeout can get longer, if the client is blocked and a message arrives into a list, the client is correctly unblocked and the message sent to the client.

The problem I observed was when the message arrived on the list after the client timeout (5 seconds) but before the server timeout (1-10 seconds):

0 seconds - client sends BRPOPLPUSH with 1-second timeout
1 sec - waiting
2 sec - waiting
3 sec - waiting
4 sec - waiting
5 sec - client raises timeout error
6 sec - message arrives on list, client is unblocked
7 sec - server loop checks client

I think redis/redis@25e1cb3 fixes this perfectly. Now it should never block longer than 2 seconds total.

antirez · 2015-07-16T16:12:51Z

Thanks @cainlevy, so it's the Ruby side that disconnects the client actively for client-side timeout. Now I understand. Probably (but it's just a far shot) what happens is that the socket waits for the Ruby garbage collector in order to be reclaimed (and actually closed)? If possible the Ruby client, on timeout, should call the appropriate method in order to close the underlying TCP socket ASAP, so that the connection closed if sensed by Redis (that does not need to check the client: an event will be raised in the event loop immediately). Messages can be still lost of course, there is no way to guarantee with TCP a reliable channel when disconnections happen (if not with resending & acknowledgement) but the window is greatly diminished.

Thanks for checking the commit, should go live tomorrow. Unfortunately I'm not fixing 2.8 as well since I can't just cherry pick and the details are different. Given that we started the process of abandoning the 2.8 release if not for critical issues, I'm not going to change it.

EDIT: to be clear, if Ruby closes the connection ASAP this is what should happen:

0 seconds - client sends BRPOPLPUSH with 1-second timeout
1 sec - waiting
2 sec - waiting
3 sec - waiting
4 sec - waiting
5 sec - client raises timeout error and closes the socket
5.01 sec - Redis traps the socket error and disconnects the client, unregistering it from the clients waiting for list operations.
6 sec - message arrives on list, and gets accumulated for the next client.

As you can see there is still a window for race conditions but much shorter.

mperham · 2015-07-16T16:19:45Z

Thank you @antirez!

Is there no way to perform client timeout processing in a separate thread, to reduce latency?

antirez · 2015-07-16T16:53:46Z

You are welcome @mperham. There is no problem of computational intensive task that adds latency, checking even 10k clients per second does not introduce any latency, even so if evenly computed in HZ different function calls in an incremental way. The problem is that currently we are brute forcing the problem by rotating the list of clients searching for timed out ones. This is technically dumb, so I approached the problem when I tried to fix this issue thinking about ordering clients for timeout in a skiplist (or any other O(logN)-insert, O(1) query for MIN/MAX data structure). This way you just process clients that are actually in timeout, and unblock/release them.

However for the rule "try the simplest thing that works" I tried if I could measure processing 300/400 clients per iteration (in the case of 3000/4000 clients connected) and was not able to measure any difference. So I went, at least for now, for the simplest solution not involving any skiplist.

It helps that it's just a matter of raising HZ in order to have just 300/400 clients processed per function call if there are much more clients, like 40k connected at the same time.

If I'll go for the skiplist, I'll also tune the event loop in order to exit at the right time when the next client is going to expire, to get a 10ms granularity on expires. Doing it with the current internals would make the code just more complex with just some noticeable gain in precision but not to the extend I would love to see.

cainlevy · 2015-07-16T17:29:03Z

@antirez I found a moment to try redis/redis@25e1cb3, and it works beautifully. With 2040 connected clients, time redis-cli blpop test 1 takes ~1.1s on Redis 3.1.999, compared to ~7s on Redis 3.0.2.

Unfortunately I'm not fixing 2.8 as well since I can't just cherry pick and the details are different.

Works for me. We'll tune hz until we can upgrade.

@mperham Your call on when to close this issue. I'm happy!

mperham · 2015-07-16T17:33:18Z

🎉 Thank you everyone.

If you are on Redis <3.0.3, increase hz from 10 or upgrade to 3.0.3.

antirez · 2015-07-16T21:55:31Z

Thanks everyone!

mperham added a commit that referenced this issue Jul 12, 2015

Add helper method to encapsulate hz check, #2431

4433e68

mperham closed this as completed Jul 16, 2015

jonhyman mentioned this issue Oct 1, 2015

brpop is impeding performance #2581

Closed

lwe mentioned this issue Feb 25, 2016

Incompatibility with servers that rename-command INFO (since 4.1.0) #2850

Closed

97jaz mentioned this issue Apr 4, 2023

Frequent ReadTimeouts to Redis for Limiters started to appear after upgrading to sidekiq 7 and redis 5 #5857

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReliableFetch performs poorly when Redis server has many connections #2431

ReliableFetch performs poorly when Redis server has many connections #2431

cainlevy commented Jul 12, 2015

mperham commented Jul 12, 2015

mperham commented Jul 12, 2015

mperham commented Jul 12, 2015

cainlevy commented Jul 12, 2015

cainlevy commented Jul 12, 2015

mperham commented Jul 12, 2015

cainlevy commented Jul 13, 2015

mperham commented Jul 15, 2015

jonhyman commented Jul 15, 2015

cainlevy commented Jul 15, 2015

mperham commented Jul 15, 2015

cainlevy commented Jul 15, 2015

antirez commented Jul 15, 2015

ryansch commented Jul 15, 2015

antirez commented Jul 16, 2015

antirez commented Jul 16, 2015

cainlevy commented Jul 16, 2015

antirez commented Jul 16, 2015

mperham commented Jul 16, 2015

antirez commented Jul 16, 2015

cainlevy commented Jul 16, 2015

mperham commented Jul 16, 2015

antirez commented Jul 16, 2015

ReliableFetch performs poorly when Redis server has many connections #2431

ReliableFetch performs poorly when Redis server has many connections #2431

Comments

cainlevy commented Jul 12, 2015

Blocking Operation Timeouts

Reliable Fetch:

Possible Solution:

mperham commented Jul 12, 2015

mperham commented Jul 12, 2015

mperham commented Jul 12, 2015

cainlevy commented Jul 12, 2015

cainlevy commented Jul 12, 2015

mperham commented Jul 12, 2015

cainlevy commented Jul 13, 2015

mperham commented Jul 15, 2015

jonhyman commented Jul 15, 2015

cainlevy commented Jul 15, 2015

mperham commented Jul 15, 2015

cainlevy commented Jul 15, 2015

antirez commented Jul 15, 2015

ryansch commented Jul 15, 2015

antirez commented Jul 16, 2015

antirez commented Jul 16, 2015

cainlevy commented Jul 16, 2015

antirez commented Jul 16, 2015

mperham commented Jul 16, 2015

antirez commented Jul 16, 2015

cainlevy commented Jul 16, 2015

mperham commented Jul 16, 2015

antirez commented Jul 16, 2015