Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReliableFetch performs poorly when Redis server has many connections #2431

Closed
cainlevy opened this issue Jul 12, 2015 · 23 comments
Closed

ReliableFetch performs poorly when Redis server has many connections #2431

cainlevy opened this issue Jul 12, 2015 · 23 comments

Comments

@cainlevy
Copy link
Contributor

Blocking Operation Timeouts

When Redis has many connections, blocking operations can take unexpectedly long amounts of time to complete [https://github.com//issues/2311]. Redis enforces timeouts on the server by walking through the client list in chunks, guaranteeing only one loop every 10 seconds.

It looks like Redis will check 50 clients per hz (500 per second with the default hz=10) unless the total number of clients exceeds 5000 (hz=10). So with 500 clients or less, Redis will return from a blocking operation of N seconds within N + 1 seconds. With 2500 clients, for example, that goes up to N + 5 seconds.

The Redis client has a default timeout of 5.0 seconds. So if the server has 2500 connections, and returns from a blocking operation of 1 second within 1-6 seconds, we have a 20% chance of the client timing out.

We have a fairly ridiculous ~4300 connections right now, so blocking operations take up to ~9 seconds and have nearly a 50% chance of timeout. Here's to you, concurrency: 1. 🍹

Reliable Fetch:

When ReliableFetch idles with a blocking operation, jobs may be pushed into the queue after the client has timed out but before the server has checked the connection. The blocking operation succeeds, but no one is listening.

The good news is that the job still exists in the private queue. The bad news is that it will only be recovered when the process restarts. With stable processes, this could be a while. To make matters worse, since ReliableFetch idles by blocking on the most important queue, it will only misplace the most important jobs.

Possible Solution:

Maybe check the private queue after recovering from a fetch exception?

@mperham
Copy link
Collaborator

mperham commented Jul 12, 2015

Wow, 3.5 yrs of heavy Redis usage and I learn something new.

On Jul 11, 2015, at 22:49, Lance Ivy notifications@github.com wrote:

Blocking Operation Timeouts

When Redis has many connections, blocking operations can take unexpectedly long amounts of time to complete [#2311]. Redis enforces timeouts on the server by walking through the client list in chunks, guaranteeing only one loop every 10 seconds.

It looks like Redis will check 50 clients per hz (500 per second with the default hz=10) unless the total number of clients exceeds 5000 (hz=10). So with 500 clients or less, Redis will return from a blocking operation of N seconds within N + 1 seconds. With 2500 clients, for example, that goes up to N + 5 seconds.

The Redis client has a default timeout of 5.0 seconds. So if the server has 2500 connections, and returns from a blocking operation of 1 second within 1-6 seconds, we have a 20% chance of the client timing out.

We have a fairly ridiculous ~4300 connections right now, so blocking operations take up to ~9 seconds and have nearly a 50% chance of timeout. Here's to you, concurrency: 1.

Reliable Fetch:

When ReliableFetch idles with a blocking operation, jobs may be pushed into the queue after the client has timed out but before the server has checked the connection. The blocking operation succeeds, but no one is listening.

The good news is that the job still exists in the private queue. The bad news is that it will only be recovered when the process restarts. With stable processes, this could be a while. To make matters worse, since ReliableFetch idles by blocking on the most important queue, it will only misplace the most important jobs.

Possible Solution:

Maybe check the private queue after recovering from a fetch exception?


Reply to this email directly or view it on GitHub.

@mperham
Copy link
Collaborator

mperham commented Jul 12, 2015

I think redis-cli info gives us all the info we need to warn the user that this is something they need to reconfigure.

hz:10
config_file:/usr/local/etc/redis.conf
connected_clients:1

The hz number is super confusing but it sounds like the net outcome is that hz is the upper bound on the number of seconds it takes Redis to check all connections for timeout. With large amounts of connections, hz needs to be lowered to 3-5. Something like this logic is conservative but wouldn't annoy smaller Sidekiq clusters:

puts "change #{config_file}" if connected_clients > 1000 and hz >= 10

WDYT?

@mperham
Copy link
Collaborator

mperham commented Jul 12, 2015

BTW Sidekiq's default server connection pool sizing is very liberal. If you've got 25 worker threads, you can get away with size: 15 or even 10 but risk making it a contention point any smaller. This will reduce your overall connected_clients count.

@cainlevy
Copy link
Contributor Author

Hmm, I get confused, do I want hz to be higher when I have more clients? The numclients/(hz*10) logic seems to guarantee a 10 second/loop maximum while the iterations < 50 branch seems to guarantee a 50 client/hz minimum. I was thinking about setting hz=25, so that Redis would check 1250 clients per second and only add 3-4 seconds of latency to blocking operations.

Do you think it'd be possible to add this logging right after a fetch exception timeout? It'd make the current error much more helpful, if people happen to find it. Logging it on boot may be misleading if the number of connected clients is still climbing after restarting the background cluster.

If this fetch timeout could be detected, I'd also be very interested in checking the private queues. Some combination of visibility (log messages), prevention (tuning hz), and mitigation (requeueing faster) could go a long ways.

This will reduce your overall connected_clients count.

Yeah, our client count is very overblown. It's partly a reflection of not splitting off a dedicated Redis host for our background queues, and partly my fault for going with a conservative concurrency: 1.

want_a_hertz_donut?

Hah, classic. 💥 🍩

@cainlevy
Copy link
Contributor Author

Oh, one more thought: tuning hz to prevent this could be tricky with auto-scaling clusters. Seems easy to forget if you've added another 10% capacity.

Dare I ask: is the blocking operation worthwhile in this scenario? Pausing for a second would be more reliable, at the cost of possibly idling longer than necessary. I just can't imagine a single right answer here.

@mperham
Copy link
Collaborator

mperham commented Jul 12, 2015

I really can't understand the Redis internal logic for hz; it's too complicated. The docs say we should increase hz to make it more precise but it looks like that means it will check less connections for each hz. So if we have 1000 connections and 10 hz, it will check 10 connections every 100 ms; all 1000 are checked in 10 sec. If we change to 20 hz, it will check 5 connections every 50 ms and all 1000 connections will still take 10 sec. What am I missing?

@cainlevy
Copy link
Contributor Author

I think this bit is the key: https://github.com/antirez/redis/blob/b96af595a5fddbbdcbf78ed3c51acd60976416f4/src/redis.c#L984-L985. If I'm reading it right, Redis will only do less than 50 clients/hz when less than 50 clients exist.

@mperham
Copy link
Collaborator

mperham commented Jul 15, 2015

Added this to the wiki, thoughts? https://github.com/mperham/sidekiq/wiki/Using-Redis#tuning

@jonhyman
Copy link
Contributor

I'm curious if we should expose the ability to not use brpoplpush but rather an rpoplpush followed by a short sleep as @cainlevy suggested. Sidekiq-Pro could either detect this itself using info or have a config option.

We have a lot of connected clients as well. Taking a look at our brpoplpush in New Relic right now, that the standard deviation is more than 1 second (and the max is 5 seconds) indicates to me that we may be get better performance doing this. We use a custom Fetcher that subclasses ReliableFetch and we already make minimal use of brpoplpush in our Fetcher, but I'm thinking of changing it.

image

@cainlevy
Copy link
Contributor Author

Added this to the wiki, thoughts? https://github.com/mperham/sidekiq/wiki/Using-Redis#tuning

👍 maybe link back to discussion for folks who want to understand more?

@mperham
Copy link
Collaborator

mperham commented Jul 15, 2015

Keep in mind, this is still a problem for non-Pro people. Sidekiq uses BRPOP which has the same timeout issue. If you have lots of clients and use blocking operations, you need to increase hz.

@cainlevy
Copy link
Contributor Author

Keep in mind, this is still a problem for non-Pro people. Sidekiq uses BRPOP which has the same timeout issue.

Oh, wow, yeah. And that algorithm doesn't have private queues to fall back on.

@antirez
Copy link

antirez commented Jul 15, 2015

Hello, I'm on it, news in the next days. Thanks for tracking this issue!

@ryansch
Copy link
Contributor

ryansch commented Jul 15, 2015

🎉

@antirez
Copy link

antirez commented Jul 16, 2015

First feedbacks:

  1. Yes the timeout changes a lot depending on the number of clients. Tuning hz is useful since for each call the function will try to process at least 50 clients, so with HZ=100 it processes 50*100 clients each second. With 5000 clients the average delay in returning is 0.5 seconds. However this is not good, we need something more deterministic and I'm fixing it.
  2. I don't understand this part of @cainlevy report:

When ReliableFetch idles with a blocking operation, jobs may be pushed into the queue after the client has timed out but before the server has checked the connection. The blocking operation succeeds, but no one is listening

Even if the timeout can get longer, if the client is blocked and a message arrives into a list, the client is correctly unblocked and the message sent to the client. If you are seeing something different, maybe it's up to the library client implementation that when detecting the connection is broken does not care to deliver the last (received) reply to the caller? Otherwise if I'm wrong, please could you send me more details?

For now I'm fixing the non deterministic timeout behavior, that should not depend on HZ so much: it may make a small rounding error difference, but the return time should be the same, and HZ should only tune how much work it does per iteration, so that if you use an higher HZ, you get the same work per second, but more evenly distributed across the second. As it is today it dramatically changes the total work it does, which is bad.

News ASAP. Thanks.

@antirez
Copy link

antirez commented Jul 16, 2015

I just pushed a fix for the timeout into the 3.0 branch (and others). It will be part of Redis 3.0.3 that will be released in the next 24h. Please could you check how things improved for you? I did my testings and everything looks fine now. More information in the commit message here:

redis/redis@25e1cb3

@cainlevy
Copy link
Contributor Author

@antirez that was quick! The commit looks great. I'll aim to test it out today.

Even if the timeout can get longer, if the client is blocked and a message arrives into a list, the client is correctly unblocked and the message sent to the client.

The problem I observed was when the message arrived on the list after the client timeout (5 seconds) but before the server timeout (1-10 seconds):

  • 0 seconds - client sends BRPOPLPUSH with 1-second timeout
  • 1 sec - waiting
  • 2 sec - waiting
  • 3 sec - waiting
  • 4 sec - waiting
  • 5 sec - client raises timeout error
  • 6 sec - message arrives on list, client is unblocked
  • 7 sec - server loop checks client

I think redis/redis@25e1cb3 fixes this perfectly. Now it should never block longer than 2 seconds total.

@antirez
Copy link

antirez commented Jul 16, 2015

Thanks @cainlevy, so it's the Ruby side that disconnects the client actively for client-side timeout. Now I understand. Probably (but it's just a far shot) what happens is that the socket waits for the Ruby garbage collector in order to be reclaimed (and actually closed)? If possible the Ruby client, on timeout, should call the appropriate method in order to close the underlying TCP socket ASAP, so that the connection closed if sensed by Redis (that does not need to check the client: an event will be raised in the event loop immediately). Messages can be still lost of course, there is no way to guarantee with TCP a reliable channel when disconnections happen (if not with resending & acknowledgement) but the window is greatly diminished.

Thanks for checking the commit, should go live tomorrow. Unfortunately I'm not fixing 2.8 as well since I can't just cherry pick and the details are different. Given that we started the process of abandoning the 2.8 release if not for critical issues, I'm not going to change it.

EDIT: to be clear, if Ruby closes the connection ASAP this is what should happen:

  • 0 seconds - client sends BRPOPLPUSH with 1-second timeout
  • 1 sec - waiting
  • 2 sec - waiting
  • 3 sec - waiting
  • 4 sec - waiting
  • 5 sec - client raises timeout error and closes the socket
  • 5.01 sec - Redis traps the socket error and disconnects the client, unregistering it from the clients waiting for list operations.
  • 6 sec - message arrives on list, and gets accumulated for the next client.

As you can see there is still a window for race conditions but much shorter.

@mperham
Copy link
Collaborator

mperham commented Jul 16, 2015

Thank you @antirez!

Is there no way to perform client timeout processing in a separate thread, to reduce latency?

@antirez
Copy link

antirez commented Jul 16, 2015

You are welcome @mperham. There is no problem of computational intensive task that adds latency, checking even 10k clients per second does not introduce any latency, even so if evenly computed in HZ different function calls in an incremental way. The problem is that currently we are brute forcing the problem by rotating the list of clients searching for timed out ones. This is technically dumb, so I approached the problem when I tried to fix this issue thinking about ordering clients for timeout in a skiplist (or any other O(logN)-insert, O(1) query for MIN/MAX data structure). This way you just process clients that are actually in timeout, and unblock/release them.

However for the rule "try the simplest thing that works" I tried if I could measure processing 300/400 clients per iteration (in the case of 3000/4000 clients connected) and was not able to measure any difference. So I went, at least for now, for the simplest solution not involving any skiplist.

It helps that it's just a matter of raising HZ in order to have just 300/400 clients processed per function call if there are much more clients, like 40k connected at the same time.

If I'll go for the skiplist, I'll also tune the event loop in order to exit at the right time when the next client is going to expire, to get a 10ms granularity on expires. Doing it with the current internals would make the code just more complex with just some noticeable gain in precision but not to the extend I would love to see.

@cainlevy
Copy link
Contributor Author

@antirez I found a moment to try redis/redis@25e1cb3, and it works beautifully. With 2040 connected clients, time redis-cli blpop test 1 takes ~1.1s on Redis 3.1.999, compared to ~7s on Redis 3.0.2.

Unfortunately I'm not fixing 2.8 as well since I can't just cherry pick and the details are different.

Works for me. We'll tune hz until we can upgrade.

@mperham Your call on when to close this issue. I'm happy!

@mperham
Copy link
Collaborator

mperham commented Jul 16, 2015

🎉 Thank you everyone.

If you are on Redis <3.0.3, increase hz from 10 or upgrade to 3.0.3.

@mperham mperham closed this as completed Jul 16, 2015
@antirez
Copy link

antirez commented Jul 16, 2015

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants