-
Notifications
You must be signed in to change notification settings - Fork 331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative timeout mechanism worker operations (the work method) #343
Comments
Not having timeouts is not great. Most users will never "decide for themselves", they will run in production without any timeouts. What options are available in Ruby these days? |
Sorry for the delay in response; the notification had got hidden in the Junk folder for some reason. Not sure there is a clean solution that doesn't require some sort of cooperation from the client code. If one absolutely wants to specify a global timeout for jobs (and I'm not convinced that one does), one could provide some sort of method client code could call to check whether it should consider itself cancelled. Then, client code could cleanly cancel itself -- in particular, it could clean up connections or locks etc. The current implementation effectively causes an exception to be thrown at an arbitrary place in the client code, making it impossible for client code to do any clean up. As a result it is unsafe. The article I link to in my original post makes a very clear case for this problem. Perhaps the cleanest solution is to defer this responsibility to clients. The developers of jobs should know if there are parts that are potentially long running (and potentially too long). They could then implement dedicated timeouts for these, which are well understood by the relevant low-level infrastructure---for example using network or connection timeouts. |
https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying/ There is also cool repo about timeouts in the major Ruby gems |
Thanks, @victorhazbun and @Tensho for adding your view. @michaelklishin have these comments helped making this issue more relevant? |
@victorhazbun I cannot access that video even after I enter my email. Is it on YouTube? And please stop with the the "thank me later" part. It's in everyone's interest to improve open source software and the maintainers very rarely get thanked for it. @szschaler there is no disagreement that finding an alternative would be great because I have seen many places in many distributed systems (tools. libraries, apps, anything) where not having a timeout meant very real operational problems. I am not a fan of "no timeouts, let the user decide" approach. They user won't decide, many will copy paste some code from the Internet and go into production with it, and then claim that the project "has issues", "hangs" and so on. Ruby doesn't have the thread interruption mechanism of the JVM. Are there existing ways to emulate it? It's basically a flag threads check periodically but there's plenty of ways to get it wrong. |
@Tensho the repo is definitely useful but the only specific action from it I can think of is "inspect what a bunch of projects use and evaluate those options". Was that what you were getting at? |
One way for Sneakers to communicate a timeout without enforcing it would be to set a boolean property on the worker that the worker code will then supposedly check in the
This is similar to what's suggested in the JRuby lead maintainer's post:
We don't want to kill any threads here but we do want to stop execution of a method. It can happen eventually. |
Take a closer look at how Timeout.timeout is used in Sneakers: Sneakers will handle the exception (so, user code is not expected to) and then call an error callback. The only disruptive part here is that the code will stop executing. No exceptions will be thrown in the user's Julia's post mentions that there are scenarios where this is not unreasonable:
Of course, Sneakers cannot possible know what the worker does. If it's in the middle of settling an order or a financial transaction the timing can be very bad. |
The easiest solution that can be implemented in minutes is to significantly increase the default. A timeout of, say, 10 minutes is objectively well above of what most workers do. Would that be an improvement? It would prevent workers that deadlock/livelock/cannot make any progress otherwise from consuming the Bunny consumer dispatch thread pool threads forever (until the process restarts, at least). It also likely won't kick in for all but the most long-running operations. It also will not satisfy the "OMG Timeout.timeout is D-A-N-G-E-R-O-U-S and must be cleansed with fire, details be damned!" crowd. |
See #343 for background. 5 seconds is an objectively low default. An alternative timeout mechanism is still under consideration.
That's certainly a sensible start. It's still fundamentally broken, though: the real problem is that, because this is user code you're interrupting, you cannot know you're not interrupting a sensitive operation---say, one that has a resource opened, but not closed yet, leading to a resource leak downstream. If users aren't aware that this may happen, interrupting their code may actually be at least as bad as letting it hang. The exception still is thrown inside user code, so may happen to be caught in there (that's how I originally found the issue). I don't necessarily think that that's a major issue, however: the problem really is in the code that just got interrupted and may have left things in an incomplete state. Bumping the default timeout is certainly a sensible first step. Would it perhaps also make sense to at least flag this issue up prominently in the documentation and perhaps even link it to one of the pages linked above so that future users have an easier way of finding out and deciding whether this is appropriate for them and how to handle timeouts should they occur? |
@michaelklishin Appreciate your contribution to entire RMQ area and Ruby client particularly, you are strict and fair, I like it ^_^
Yes
Maybe I'm wrong, but I guess the problem shines if some library code inside the I agree with you, most users don't care much about the edge cases until the total fail. I see the only proper way to outline timeouts handling in the documentation and try to spread the knowledge like Sidekiq did. Also I'm too young to understand the kung-fu of
Actually Ruby has Thread.handle_interrupt method, but it's too complex to use properly. Even author says:
As you said, we are all interested in improving open source, so I just want to clarify the situtation and get better understanding of the problem. |
The docs will be updated (in fact, they are primarily in the repo wiki which should be editable by non-collaborators). Bunny limits its use We can get past this impasse by removing the timeout and letting the user do it in a more controlled manner. It's hard but at least not impossible to do this saferly. Honestly, a 10 minute timeout in practice is comparable to not having any. @Tensho do you agree to the complete timeout removal? FTR, I rest my case that systems that do not have timeouts are fundamentally broken. But we have to work within certain limits here. |
@michaelklishin I'm OK with removal, I even have a separate local branch with removed timeout, but stuck with tests, because Sneakers uses Anyway I'm happy we have consolidated decision, I guess @szschaler is about the same. Thank you 🙇 |
Many thanks for the good discussion. This does indeed sound like a good outcome. I'm happy to help review code etc, but may not be much good in contributing tests (though I may be able to review and revise existing tests) as I am also not really a |
See #343 for the background. Worker implementations now must handle timeouts in a way they see fit. This is unfortunate but there is no way to enforce timeouts in a safe way without at least some Worker implementation cooperation and currently there is no decision on what that should look like. We previously increased the timeout to 10 minutes. This is about as good as not having any, so let's rip off the bandaid. Closes #343.
#354 is ready to go but I'd like to get a bit more feedback (especially from long time contributors), so lets wait for a few days. @szschaler @Tensho you are welcome to review it as soon as you have a moment, of course. |
Hi,
Could you please remove the use of
Timeout::timeout
from line 55 ofworker.rb
? This function is inherently unsafe and will potentially 'randomly' kill workers.At the very least, would it be possible to default
:timeout_job_after
to 0 so that timeouts are switched off by default? Then people can decide for themselves whether they want to run the risks ofTimeout::timeout
in their code.I've been hit by this a few times now and finally understood what's going on. Given that the timeout exceptions spring up in random places in my worker code (in fact, deep in some random Rails library) it took a long time to understand that it was actually Sneakers breaking stuff and not my own code. The ability to time out is fairly well hidden in the documentation, so I only found it when I specifically searched for timeout in connection to Sneakers.
Many thanks,
Steffen
The text was updated successfully, but these errors were encountered: