-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReliableFetch performs poorly when Redis server has many connections #2431
Comments
Wow, 3.5 yrs of heavy Redis usage and I learn something new.
|
I think
The
WDYT? |
BTW Sidekiq's default server connection pool sizing is very liberal. If you've got 25 worker threads, you can get away with |
Hmm, I get confused, do I want hz to be higher when I have more clients? The Do you think it'd be possible to add this logging right after a fetch exception timeout? It'd make the current error much more helpful, if people happen to find it. Logging it on boot may be misleading if the number of connected clients is still climbing after restarting the background cluster. If this fetch timeout could be detected, I'd also be very interested in checking the private queues. Some combination of visibility (log messages), prevention (tuning hz), and mitigation (requeueing faster) could go a long ways.
Yeah, our client count is very overblown. It's partly a reflection of not splitting off a dedicated Redis host for our background queues, and partly my fault for going with a conservative
Hah, classic. 💥 🍩 |
Oh, one more thought: tuning Dare I ask: is the blocking operation worthwhile in this scenario? Pausing for a second would be more reliable, at the cost of possibly idling longer than necessary. I just can't imagine a single right answer here. |
I really can't understand the Redis internal logic for hz; it's too complicated. The docs say we should increase |
I think this bit is the key: https://github.com/antirez/redis/blob/b96af595a5fddbbdcbf78ed3c51acd60976416f4/src/redis.c#L984-L985. If I'm reading it right, Redis will only do less than 50 clients/hz when less than 50 clients exist. |
Added this to the wiki, thoughts? https://github.com/mperham/sidekiq/wiki/Using-Redis#tuning |
I'm curious if we should expose the ability to not use We have a lot of connected clients as well. Taking a look at our |
👍 maybe link back to discussion for folks who want to understand more? |
Keep in mind, this is still a problem for non-Pro people. Sidekiq uses BRPOP which has the same timeout issue. If you have lots of clients and use blocking operations, you need to increase hz. |
Oh, wow, yeah. And that algorithm doesn't have private queues to fall back on. |
Hello, I'm on it, news in the next days. Thanks for tracking this issue! |
🎉 |
First feedbacks:
Even if the timeout can get longer, if the client is blocked and a message arrives into a list, the client is correctly unblocked and the message sent to the client. If you are seeing something different, maybe it's up to the library client implementation that when detecting the connection is broken does not care to deliver the last (received) reply to the caller? Otherwise if I'm wrong, please could you send me more details? For now I'm fixing the non deterministic timeout behavior, that should not depend on HZ so much: it may make a small rounding error difference, but the return time should be the same, and HZ should only tune how much work it does per iteration, so that if you use an higher HZ, you get the same work per second, but more evenly distributed across the second. As it is today it dramatically changes the total work it does, which is bad. News ASAP. Thanks. |
I just pushed a fix for the timeout into the 3.0 branch (and others). It will be part of Redis 3.0.3 that will be released in the next 24h. Please could you check how things improved for you? I did my testings and everything looks fine now. More information in the commit message here: |
@antirez that was quick! The commit looks great. I'll aim to test it out today.
The problem I observed was when the message arrived on the list after the client timeout (5 seconds) but before the server timeout (1-10 seconds):
I think redis/redis@25e1cb3 fixes this perfectly. Now it should never block longer than 2 seconds total. |
Thanks @cainlevy, so it's the Ruby side that disconnects the client actively for client-side timeout. Now I understand. Probably (but it's just a far shot) what happens is that the socket waits for the Ruby garbage collector in order to be reclaimed (and actually closed)? If possible the Ruby client, on timeout, should call the appropriate method in order to close the underlying TCP socket ASAP, so that the connection closed if sensed by Redis (that does not need to check the client: an event will be raised in the event loop immediately). Messages can be still lost of course, there is no way to guarantee with TCP a reliable channel when disconnections happen (if not with resending & acknowledgement) but the window is greatly diminished. Thanks for checking the commit, should go live tomorrow. Unfortunately I'm not fixing 2.8 as well since I can't just cherry pick and the details are different. Given that we started the process of abandoning the 2.8 release if not for critical issues, I'm not going to change it. EDIT: to be clear, if Ruby closes the connection ASAP this is what should happen:
As you can see there is still a window for race conditions but much shorter. |
Thank you @antirez! Is there no way to perform client timeout processing in a separate thread, to reduce latency? |
You are welcome @mperham. There is no problem of computational intensive task that adds latency, checking even 10k clients per second does not introduce any latency, even so if evenly computed in HZ different function calls in an incremental way. The problem is that currently we are brute forcing the problem by rotating the list of clients searching for timed out ones. This is technically dumb, so I approached the problem when I tried to fix this issue thinking about ordering clients for timeout in a skiplist (or any other O(logN)-insert, O(1) query for MIN/MAX data structure). This way you just process clients that are actually in timeout, and unblock/release them. However for the rule "try the simplest thing that works" I tried if I could measure processing 300/400 clients per iteration (in the case of 3000/4000 clients connected) and was not able to measure any difference. So I went, at least for now, for the simplest solution not involving any skiplist. It helps that it's just a matter of raising HZ in order to have just 300/400 clients processed per function call if there are much more clients, like 40k connected at the same time. If I'll go for the skiplist, I'll also tune the event loop in order to exit at the right time when the next client is going to expire, to get a 10ms granularity on expires. Doing it with the current internals would make the code just more complex with just some noticeable gain in precision but not to the extend I would love to see. |
@antirez I found a moment to try redis/redis@25e1cb3, and it works beautifully. With 2040 connected clients,
Works for me. We'll tune @mperham Your call on when to close this issue. I'm happy! |
🎉 Thank you everyone. If you are on Redis <3.0.3, increase hz from 10 or upgrade to 3.0.3. |
Thanks everyone! |
Blocking Operation Timeouts
When Redis has many connections, blocking operations can take unexpectedly long amounts of time to complete [https://github.com//issues/2311]. Redis enforces timeouts on the server by walking through the client list in chunks, guaranteeing only one loop every 10 seconds.
It looks like Redis will check 50 clients per hz (500 per second with the default hz=10) unless the total number of clients exceeds 5000 (hz=10). So with 500 clients or less, Redis will return from a blocking operation of N seconds within N + 1 seconds. With 2500 clients, for example, that goes up to N + 5 seconds.
The Redis client has a default timeout of 5.0 seconds. So if the server has 2500 connections, and returns from a blocking operation of 1 second within 1-6 seconds, we have a 20% chance of the client timing out.
We have a fairly ridiculous ~4300 connections right now, so blocking operations take up to ~9 seconds and have nearly a 50% chance of timeout. Here's to you,
concurrency: 1
. 🍹Reliable Fetch:
When ReliableFetch idles with a blocking operation, jobs may be pushed into the queue after the client has timed out but before the server has checked the connection. The blocking operation succeeds, but no one is listening.
The good news is that the job still exists in the private queue. The bad news is that it will only be recovered when the process restarts. With stable processes, this could be a while. To make matters worse, since ReliableFetch idles by blocking on the most important queue, it will only misplace the most important jobs.
Possible Solution:
Maybe check the private queue after recovering from a fetch exception?
The text was updated successfully, but these errors were encountered: