Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative timeout mechanism worker operations (the work method) #343

Closed
szschaler opened this issue Feb 23, 2018 · 15 comments
Closed

Alternative timeout mechanism worker operations (the work method) #343

szschaler opened this issue Feb 23, 2018 · 15 comments

Comments

@szschaler
Copy link

Hi,

Could you please remove the use of Timeout::timeout from line 55 of worker.rb? This function is inherently unsafe and will potentially 'randomly' kill workers.

At the very least, would it be possible to default :timeout_job_after to 0 so that timeouts are switched off by default? Then people can decide for themselves whether they want to run the risks of Timeout::timeout in their code.

I've been hit by this a few times now and finally understood what's going on. Given that the timeout exceptions spring up in random places in my worker code (in fact, deep in some random Rails library) it took a long time to understand that it was actually Sneakers breaking stuff and not my own code. The ability to time out is fairly well hidden in the documentation, so I only found it when I specifically searched for timeout in connection to Sneakers.

Many thanks,

Steffen

@michaelklishin
Copy link
Collaborator

Not having timeouts is not great. Most users will never "decide for themselves", they will run in production without any timeouts. What options are available in Ruby these days?

@szschaler
Copy link
Author

Sorry for the delay in response; the notification had got hidden in the Junk folder for some reason.

Not sure there is a clean solution that doesn't require some sort of cooperation from the client code. If one absolutely wants to specify a global timeout for jobs (and I'm not convinced that one does), one could provide some sort of method client code could call to check whether it should consider itself cancelled. Then, client code could cleanly cancel itself -- in particular, it could clean up connections or locks etc.

The current implementation effectively causes an exception to be thrown at an arbitrary place in the client code, making it impossible for client code to do any clean up. As a result it is unsafe. The article I link to in my original post makes a very clear case for this problem.

Perhaps the cleanest solution is to defer this responsibility to clients. The developers of jobs should know if there are parts that are potentially long running (and potentially too long). They could then implement dedicated timeouts for these, which are well understood by the relevant low-level infrastructure---for example using network or connection timeouts.

@Tensho
Copy link
Contributor

Tensho commented Apr 16, 2018

https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying/

There is also cool repo about timeouts in the major Ruby gems
https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts

@szschaler
Copy link
Author

Thanks, @victorhazbun and @Tensho for adding your view. @michaelklishin have these comments helped making this issue more relevant?

@michaelklishin
Copy link
Collaborator

michaelklishin commented May 30, 2018

@victorhazbun I cannot access that video even after I enter my email. Is it on YouTube? And please stop with the the "thank me later" part. It's in everyone's interest to improve open source software and the maintainers very rarely get thanked for it.

@szschaler there is no disagreement that finding an alternative would be great because Thread#raise on user-provided code makes it unpredictable for, well, the user. I am not a fan of blanket statements such as "the function is inherently dangerous" (Bunny uses it internally in a reasonable way that either doesn't affect user code or is predictable) and I don't see a whole lot of specific suggestions about what could replace it, though. This post outlines the problem well but then doesn't seem to mention any solutions available in Ruby.

I have seen many places in many distributed systems (tools. libraries, apps, anything) where not having a timeout meant very real operational problems. I am not a fan of "no timeouts, let the user decide" approach. They user won't decide, many will copy paste some code from the Internet and go into production with it, and then claim that the project "has issues", "hangs" and so on.

Ruby doesn't have the thread interruption mechanism of the JVM. Are there existing ways to emulate it? It's basically a flag threads check periodically but there's plenty of ways to get it wrong.

@michaelklishin
Copy link
Collaborator

@Tensho the repo is definitely useful but the only specific action from it I can think of is "inspect what a bunch of projects use and evaluate those options". Was that what you were getting at?

@michaelklishin
Copy link
Collaborator

michaelklishin commented May 30, 2018

One way for Sneakers to communicate a timeout without enforcing it would be to set a boolean property on the worker that the worker code will then supposedly check in the #work method. It has
a couple of things that I don't like:

  • Most users will never do it
  • It has to be reset before every message, so we are adding code that messes with user-provided objects on the hot path

This is similar to what's suggested in the JRuby lead maintainer's post:

If you want to be able to kill a thread, write its code such that it periodically checks whether it should terminate.

We don't want to kill any threads here but we do want to stop execution of a method. It can happen eventually.

@michaelklishin
Copy link
Collaborator

michaelklishin commented May 30, 2018

Take a closer look at how Timeout.timeout is used in Sneakers: Sneakers will handle the exception (so, user code is not expected to) and then call an error callback.

The only disruptive part here is that the code will stop executing. No exceptions will be thrown in the user's #work implementation.

Julia's post mentions that there are scenarios where this is not unreasonable:

during a network request (ok, as long as the surrounding code is prepared to catch Timeout::Error)

Of course, Sneakers cannot possible know what the worker does. If it's in the middle of settling an order or a financial transaction the timing can be very bad.

@michaelklishin michaelklishin changed the title Avoid use of Timeout::timeout when calling work method Alternative timeout mechanism worker operations (the work method) May 30, 2018
@michaelklishin
Copy link
Collaborator

michaelklishin commented May 30, 2018

The easiest solution that can be implemented in minutes is to significantly increase the default. A timeout of, say, 10 minutes is objectively well above of what most workers do. Would that be an improvement? It would prevent workers that deadlock/livelock/cannot make any progress otherwise from consuming the Bunny consumer dispatch thread pool threads forever (until the process restarts, at least). It also likely won't kick in for all but the most long-running operations.

It also will not satisfy the "OMG Timeout.timeout is D-A-N-G-E-R-O-U-S and must be cleansed with fire, details be damned!" crowd.

michaelklishin added a commit that referenced this issue May 30, 2018
See #343 for background. 5 seconds is an objectively low default. An alternative timeout
mechanism is still under consideration.
@szschaler
Copy link
Author

That's certainly a sensible start. It's still fundamentally broken, though: the real problem is that, because this is user code you're interrupting, you cannot know you're not interrupting a sensitive operation---say, one that has a resource opened, but not closed yet, leading to a resource leak downstream. If users aren't aware that this may happen, interrupting their code may actually be at least as bad as letting it hang.

The exception still is thrown inside user code, so may happen to be caught in there (that's how I originally found the issue). I don't necessarily think that that's a major issue, however: the problem really is in the code that just got interrupted and may have left things in an incomplete state.

Bumping the default timeout is certainly a sensible first step. Would it perhaps also make sense to at least flag this issue up prominently in the documentation and perhaps even link it to one of the pages linked above so that future users have an easier way of finding out and deciding whether this is appropriate for them and how to handle timeouts should they occur?

michaelklishin added a commit that referenced this issue May 30, 2018
@Tensho
Copy link
Contributor

Tensho commented May 31, 2018

@michaelklishin Appreciate your contribution to entire RMQ area and Ruby client particularly, you are strict and fair, I like it ^_^

... "inspect what a bunch of projects use and evaluate those options". Was that what you were getting at?

Yes

Sneakers will handle the exception (so, user code is not expected to) and then call an error callback.

Maybe I'm wrong, but I guess the problem shines if some library code inside the #work method is not ready to rescue Timeout::Error. For example, user calls ActiveRecord::Base#save! inside #work and ActiveRecord receives Timeout::Error in the middle of object preparation to be persisted. It's not prepared to rescue timeout exception. As you mentioned Sneakers worker will rescue timeout exception, but in an unpredictable way. User doesn't know exactly what should she do in an error handler – rollback transaction, reset counter, etc. Do you think user should analyze an exception backtrace to find the place of the explosion and react correspondingly?

I agree with you, most users don't care much about the edge cases until the total fail. I see the only proper way to outline timeouts handling in the documentation and try to spread the knowledge like Sidekiq did.

Also I'm too young to understand the kung-fu of Timeout.timeout usage in the bunny library ^_^ From my modest attempt to realise how does Bunny leverage it – there is a mix of Timeout.timeout and regular connection timeout. Correct or extend my flight of thought if I'm wrong.

Ruby doesn't have the thread interruption mechanism of the JVM

Actually Ruby has Thread.handle_interrupt method, but it's too complex to use properly. Even author says:

Asynchronous interrupts are difficult to use.
If you need to communicate between threads, please consider to use another way such as Queue.
Or use them with deep understanding about this method.

As you said, we are all interested in improving open source, so I just want to clarify the situtation and get better understanding of the problem.

@michaelklishin
Copy link
Collaborator

The docs will be updated (in fact, they are primarily in the repo wiki which should be editable by non-collaborators).

Bunny limits its use Timeout.timeout for its own code. User-submitted code (e.g. consumer blocks or instances) is not guarded by timeouts AFAIR. That's reasonable because a truly stuck app will eventually be treated as such due to the heartbeats mechanism or TCP keepalives. The same logic should apply to Sneakers.

We can get past this impasse by removing the timeout and letting the user do it in a more controlled manner. It's hard but at least not impossible to do this saferly. Honestly, a 10 minute timeout in practice is comparable to not having any. @Tensho do you agree to the complete timeout removal?

FTR, I rest my case that systems that do not have timeouts are fundamentally broken. But we have to work within certain limits here.

@Tensho
Copy link
Contributor

Tensho commented May 31, 2018

@michaelklishin I'm OK with removal, I even have a separate local branch with removed timeout, but stuck with tests, because Sneakers uses minitest and rr, which I'm not much familiar with (I'm rspec and rspec-mock fanboy). Unfortunately, I had to switch to Rails app development and stay aside of our async messaging system right now, so I'm not sure I will find the time to fix the subject and adjust docs in the nearest future. I may push my changes and maybe someone could pick up the work.

Anyway I'm happy we have consolidated decision, I guess @szschaler is about the same. Thank you 🙇

@szschaler
Copy link
Author

Many thanks for the good discussion. This does indeed sound like a good outcome. I'm happy to help review code etc, but may not be much good in contributing tests (though I may be able to review and revise existing tests) as I am also not really a minitest or rr person.

michaelklishin added a commit that referenced this issue Jun 3, 2018
See #343 for the background. Worker implementations now must handle
timeouts in a way they see fit. This is unfortunate but there is no
way to enforce timeouts in a safe way without at least some Worker implementation
cooperation and currently there is no decision on what that should look like.

We previously increased the timeout to 10 minutes. This is about as good
as not having any, so let's rip off the bandaid.

Closes #343.
@michaelklishin
Copy link
Collaborator

#354 is ready to go but I'd like to get a bit more feedback (especially from long time contributors), so lets wait for a few days. @szschaler @Tensho you are welcome to review it as soon as you have a moment, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants