-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supervisor restart backoff #1287
Conversation
I've been arguing against such stuff before because it feels to me that supervisors that require that kind of back off are not supervising processes correctly. I've written It's About The Guarantees on this topic before. The gist of it is that the supervision structure should encode the guarantees of its workers for me. Supervised processes provide guarantees in their initialization phase, not a best effort. This means that when you're writing a client for a database or service, you shouldn't need a connection to be established as part of the initialization phase unless you're ready to say it will always be available no matter what happens. In the case of connections you can expect to fail, the only guarantee you can make in the client process is that your client will be able to handle requests, but not that it will communicate to the database. It could return {error, not_connected} on all calls during failures for example. So what I gather when I see demands for such supervisors is usually that someone is encoded a guarantee that the client will be connected as part of their supervision structure, but fully understand that they can't guarantee the connection. |
If immediate restart of a child fails, delay further restart attempts in exponential increments from min_delay (at least 1 ms) up to max_delay. The maximum number of restarts is still decided by the intensity/period parameters, but this makes it possible to control how fast restarts will happen. Setting max_delay to zero disables delays. Before this change, it could easily happen that a supervisor would use up all restart attempts within less than a millisecond, not giving transient errors any time to be cleared. This could bring down a whole system due to something as simple as a TCP port being held briefly by the OS before being reusable again, or even Erlang's own process registry not freeing a name quickly enough due to multicore cpu scheduling.
8c91036
to
8f0ad40
Compare
Please read my answers in that old mail thread first. In cases like these that I describe, it is simply not possible to handle it in a reasonable way from outside the supervisor. Neither is it reasonable to tell someone that yes, sure, they spent time setting up a supervision hierarchy, but it can actually fall over completely and bring down the system due to a small lag in resource management on the OS level. If there is to be any point at all with repeated supervision restarts, there needs to be a backoff. |
I'll add that in the case of connection management, while currently the best practice is to not do it in init and retry in your own process until you finally manage to connect, this implies that you have to reimplement this retry logic in all your processes (with or without a third party library), for something that could be done easily at the supervisor level. I also don't think Erlang developers are warned enough about possible issues. Something as simple as opening a file in init/2 could trigger many fast supervisor restarts if you run out of fds or there's a temporary disk error. Are you always that careful about what you call in init? I know I'm not. Sometimes you call third party library functions without knowing whether they are safe to call. A quick glance into OTP source shows me at least one issue in the I know that in an ideal world everyone is extra careful about every possible resource issues, but in the real world software is flawed and you can't account for every possible issues. I don't have a strong opinion about whether this should be the default, I think it should with reasonable values; but it should be at the very least be available when I flick a switch. |
There are a few issues which may not be solved or tracked elsewhere which make me wonder if Erlang/OTP would benefit from a new supervisor module, to avoid legacy problems. This list is not meant to be exhaustive, just things that seem like they will be a constant annoyance:
I know (1) above can't be fixed due to legacy usage of the current supervisor module and that has already been discussed on the bugs email list. (2) should be fixable, but is generally avoided when people know what they are doing, it is just something good to enforce to avoid broken source code. (3) may seem like it requires global state (ets), but ideally the parent supervisor could provide this information to the child, so the limit is enforced without global state. |
@richcarl Also, one of the reasons why I was thinking about the contents of comment above (#1287 (comment)) is that we might also want a constant delay, to avoid the ambiguity with exponential growth, even with a low max_delay that can obviously suppress this exponential growth. Not saying that it should be added now, since I am not trying to distract from this discussion or the changes as-is, just hoping that there is nothing preventing its addition later, if there is proper support for doing so. |
@ferd Note that this is first and foremost about surviving brief transient problems, not 15 minutes of "NFS server not responding still trying". It's also about acquiring resources that are absolutely necessary for the process: a TCP server is no good if it can't get hold of the port it's supposed to listen to, and an Erlang registered process that cannot register its name cannot be used. In such cases, they are right to crash. I'm also sure you're not advocating that all Erlang programmers should get in the habit of writing wait-and-retry loops around every call to register/2, or add calls to timer:sleep() at the start of each gen_server init() to make it more likely that their resources are available. That is not the Erlang/OTP way - we should be writing code for the happy path and rely on the OTP framework for the difficult parts of error handling. The thing is that resources like these are managed outside the scope of the code; by the VM, or the OS. Suppose you have a supervised service that was working but encountered an obscure bug and crashed. The supervisor restarts it, but because of delays elsewhere, it is not able to re-acquire its critical resources just yet (even though you specified REUSEADDRESS, there can be a brief delay). Without backoff, the whole supervision tree will reach its restart limit within a moment (modern CPUs are fast...), and the whole thing gets escalated to a VM restart or even a full host reboot. All because nobody tried to slow down all that restarting for a millisecond or two. |
A slightly more convoluted situation that I've also seen bring down a node, is when one child (ch1) crashes repeatedly causing the immediate supervisor (sup2) to give up, terminating the remaining children ch2-chN and escalating to the next supervisor (sup1), which tries to restart sup2 and everything under it. The shutdown of children is however not completely monitored and synchronous in a transitive way. If one of ch2-chN have linked subprocesses, those processes will have received exit signals but may still be alive when sup1 tries to restart sup2, and that may prevent the new instances of ch1-chN to start up, for example because they are still holding a registered name, named ets table or similar global resource. This kills the system. |
Right, the use case of held names or tables is a very interesting one that likely is going to show up more often as more and more cores (and possibly more NIFs!) get to see widespread use, these race conditions could show up more and more. As much as I'd like to go "well just do it in the terminate function", the argument breaks down easily there since not all resources see this possibility, and it would not be clean to unregister a name that was registered through the At this point, I'm still not convinced exponential or increasing backoffs are a good strategy to put in a supervisor (and backoff shows many strategies of increasing backoffs could be preferred). The way I see it, to work fine without impacting the ability of a supervisor to shut itself down in case of repeated failures, the max delay would have to be a function of the current so if I have a supervisor whose tolerance is 3 restarts per 10 seconds (or 10,000 ms), I should be able to run all of my attempts fast enough to cause its death in order to preserve the "guarantees" approach without making a general exponential backoff mechanism: 1> (timer:seconds(10) div 3).
3333 meaning that my dynamically chosen max delay should be 3333ms in order to ensure an ability to kill my supervisor no matter what (a restart frequency of 0 should not be even a thing to look in there). To avoid wide changes in behaviours, I still think that the restarts should be short; whether exponential or fibonacci growth is best I'm not sure, but that MaxR and MaxT value is a surefire top restart value to avoid messing up with most of the supervisor semantics. There's an interesting question on whether this should have an effect on apps that accidentally expect restarts to take place as fast as possible. I'm kind of wondering if leaving a default of max restart value to 0 would be good (or just 1ms to help alleviate a bit?), but then you could let the user override it with a logic a bit like: MaxRestartDelay = min(timer:seconds(MaxT) div MaxR, proplists:get_value(max_restart, L, Default)),
... which would still let: In all cases, this requires some interesting thinking. We're entering what feels like a very uncomfortable non-deterministic solution to an already nondeterministic problem, which feels wrong with supervisors. |
To correct some old/false information, erlang resources like registered name and ets tables are cleaned up before any monitor (or link) down is sent. This was fixed in OTP-R11. |
…nstance to delay a restart (attempting to avoid spurious failures when the failure cause requires a short amount of time to be resolved) based on the suggestions of Richard Carlsson in erlang/otp#1287 . Fix service restarts to ensure all terminate functions have completed (or a kill exit has been sent) before restarting the service instance. Fix service configuration option timeout_terminate value output in log output of the service configuration.
…nstance to delay a restart (attempting to avoid spurious failures when the failure cause requires a short amount of time to be resolved) based on the suggestions of Richard Carlsson in erlang/otp#1287 . Fix service restarts to ensure all terminate functions have completed (or a kill exit has been sent) before restarting the service instance. Fix service configuration option timeout_terminate value output in log output of the service configuration.
@dgud That's not what I was referring to, but the case when a process that owns a resource has been sent an exit signal to kill it off, but it hasn't been scheduled to process that signal yet, and is therefore still alive when someone tries to start a new process to take ownership of the same resource. |
@ferd Yes, as runtime systems and OS:es get more and more adapted to multicore, they will look more and more like clouds of services in separate threads on possibly separate cores. This means that all requests to the system like "release this resource" will turn into "ask the system to release this resource at its leisure" rather than being atomic and synchronous operations that don't return control until the operation has completed. You'll thus start to see more and more of these kinds of race situations at the platform level. Our code needs to get rid of assumptions about things happening immediately. And I do think that the backoff needs to be part of the code that handles the retries, much like in a transaction manager, in Ethernet, or similar. It's a simple mechanism to ensure that if you restart more than once, you don't blow your entire restart quota within a microsecond - which undermines the whole idea of a supervision tree. Not having a backoff there is, well, a bug. There is no reason to think that if one restart attempt failed, another restart with the same parameters within the same fraction of time will magically work. I see no need to overengineer this; I doubt very much that for this use case, you'll ever see any advantage of a fibonacci, square, or other sequence. With binary exponential backoff, you quickly hone in on the necessary wait time without using too many restart attempts, when you have no other clues to guide you. (In practice, in the crashing cases that I have described, 1-2 iterations has been enough to make the problem just go away, so you hardly notice the repeated attempts.) And by default, the restarts are short. It will try a few really short delays first: 0 ms, 1 ms, 2, 4, ... but if it keeps failing it also fairly quickly grows up to more "human-level" delays: 128 ms, 256 ms, before the default max limit of 500 ms. Even an app that assumes very fast restarts would need to burn through 5 failed restart attempts before even reaching 10 ms delays. Using MaxT/MaxR as a hard limit might be a good idea though, for safety. If you, as you say, only want to tolerate 3 restarts per 10 seconds, you'd hit that intensity within about 3 ms (0 + 1 + 2) with the default delay settings. As I said in a previous comment, by default you'll use up your first 10 attempts within the first second. In 4 seconds, you'll be on the 12th restart and in 16 seconds on the 14th restart. This is well within the typical intensity limits usually used in supervisors, so yes, they will keep terminating as they should - that part I had already considered. But if you have kind-of-weird settings like 15+ restarts per 30 seconds, you'd end up with too few restarts in the time window to trigger the intensity limit. By respecting the max delay setting but also not allowing any delays longer than MaxR/MaxT (for 15 restarts per 30 seconds, that's 2000 ms per restart), not even those apps will have to be modified. I'll amend my patch. By the way, do look at #1289 and see what you think about the way I have described how to get a grip on intensity/period settings. |
Helo,
I agree with Richard's idea. I had similar problems recently. Process
restarts were failing till the system crash because process keys were
not unregistered before new processes tried to register themself
again. Any crash on the system was taking the entire node down with
probability close to 1. The processes were registered using {via,
gproc, Key} and supervisors were configured with shutdown timeout
instead of brutal_kill.
Before this case, I tried to handle backoff while initializing worker
processes. That was manageable for processes like connections to a
database etc. But the name registration is so generic function, that I
would need to handle the delays in almost all the processes. For that
time, I came with an workaround like this:
start_link(RegName, Args) ->
gen_fsm:start_link(wait_if_running(RegName, 10), ?MODULE, Args, []).
wait_if_running({via, Registry, Name}, WaitMS) ->
case Registry:whereis_name(Name) of
undefined -> RegName;
Pid -> ok = timer:sleep(WaitMS), RegName
end.
It solved the problem at that time, although it is weird. I think a
supervisor with a backoff support would be a much more adequate
solution.
Karolis
…On Mon, Dec 19, 2016 at 12:52 PM, Richard Carlsson ***@***.***> wrote:
@ferd Yes, as runtime systems and OS:es get more and more adapted to
multicore, they will look more and more like clouds of services in separate
threads on possibly separate cores. This means that all requests to the
system like "release this resource" will turn into "ask the system to
release this resource at its leisure" rather than being atomic and
synchronous operations that don't return control until the operation has
completed. You'll thus start to see more and more of these kinds of race
situations at the platform level. Our code needs to get rid of assumptions
about things happening immediately.
And I do think that the backoff needs to be part of the code that handles
the retries, much like in a transaction manager, in Ethernet, or similar.
It's a simple mechanism to ensure that if you restart more than once, you
don't blow your entire restart quota within a microsecond - which undermines
the whole idea of a supervision tree. Not having a backoff there is, well, a
bug. There is no reason to think that if one restart attempt failed, another
restart with the same parameters within the same fraction of time will
magically work.
I see no need to overengineer this; I doubt very much that for this use
case, you'll ever see any advantage of a fibonacci, square, or other
sequence. With binary exponential backoff, you quickly hone in on the
necessary wait time without using too many restart attempts, when you have
no other clues to guide you. (In practice, in the crashing cases that I have
described, 1-2 iterations has been enough to make the problem just go away,
so you hardly notice the repeated attempts.)
And by default, the restarts are short. It will try a few really short
delays first: 0 ms, 1 ms, 2, 4, ... but if it keeps failing it also fairly
quickly grows up to more "human-level" delays: 128 ms, 256 ms, before the
default max limit of 500 ms. Even an app that assumes very fast restarts
would need to burn through 5 failed restart attempts before even reaching 10
ms delays.
Using MaxT/MaxR as a hard limit might be a good idea though, for safety. If
you, as you say, only want to tolerate 3 restarts per 10 seconds, you'd hit
that intensity within about 3 ms (0 + 1 + 2) with the default delay
settings. As I said in a previous comment, by default you'll use up your
first 10 attempts within the first second. In 4 seconds, you'll be on the
12th restart and in 16 seconds on the 14th restart. This is well within the
typical intensity limits usually used in supervisors, so yes, they will keep
terminating as they should - that part I had already considered. But if you
have kind-of-weird settings like 15+ restarts per 30 seconds, you'd end up
with too few restarts in the time window to trigger the intensity limit. By
respecting the max delay setting but also not allowing any delays longer
than MaxR/MaxT (for 15 restarts per 30 seconds, that's 2000 ms per restart),
not even those apps will have to be modified. I'll amend my patch.
By the way, do look at #1289 and see what you think about the way I have
described how to get a grip on intensity/period settings.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
This is just to let you know that we are working on this, it is not forgotten. There is an old OTP Technical Board decision on this functionality which says "allow it, but don't make it default" - but that was in 2012, and it is probably a good idea to refresh the memory and have another Technical Board meeting about it. |
I've been updating the calculation of the delay cap as Fred suggested, adding a couple more tests, and modifying how the delay is done so as to handle all cases. Work in progress (not pushed yet). |
That's fine. Please add commits only, i.e. do not amend. We can squash commits later if needed. Thanks! |
@richcarl, in this comment, do you mean a supervisor tree like this: |
@sirihansen I think that matches what I was describing. |
@richcarl Ok, would it be possible to explain this design choice a bit more? It is normally not a recommended structure. Maybe if the |
That structure is not something I'd recommend either, just an example of the kind of situations I've seen in real world code bringing down a system. The more general problem is with any kind of resource that is outside of the control of the supervised processes themselves (or even outside ERTS), like a TCP port managed by the OS. It could even be hardware that has some short delay before it allows you to reuse it. |
Ok, thanks. Just wanted to have all available info before the technical board meeting next week. |
First of all, we do not have a decision on this case yet. But we have discussed a bit, and here are our comments and concerns so far. We feel that there is a risk that the proposed solution will cause new problems in other situations. Although it solves one particular problem it will affect a some other scenarios and we can not We do agree that the current behavior on failed restarts (using up all restart attempts in a very short time) may be a bad thing when facing transient problems of allocating resources. It may also be the case that such problems occur more and more often due to fast machines and multicore. And we need to consider if we should have a standardized way of handling problems of this character. There are many problems related to waiting for some resource. No doubt, other projects will find the backoff mechanism useful in solving their problem, possibly needing some small change or addition So we are also discussing if we need to aim for a bigger set of problems, and if we should really implement something, or if it is design rules that we need - and pointers to solutions and libraries Some comments and questions on the implementation/solution:
In summary, we are a bit reluctant to include this solution since it does not eliminate the possibility of the problem to occur, it only reduces the probability of it happening. And as I said, we haven't yet decided whether or not to include the PR. We would really like more discussions around it to help us make the correct decision. If possible, we would like more use cases, |
Wouldn't it then make more sense to release the implementation in this PR as a library to iron out some of the possible issues? As from Erlang 19, any supervisor module can be used at the root of the application tree, so there shouldn't be any blocker to build a |
I did write a I think the backoff strategy treats the symptoms and not the cause in most situations: a supervisor restarts a child without guaranteeing that the neighbours of the dead child have finished exiting/cleaning up. However I don't think there is a reliable way to inform a supervisor of a child's neighbours. One issue I have seen is when one application is slow to restart a child (perhaps because the error bubbled up its supervision tree and the tree is blocked restarting earlier children), while a child in another application quickly churns through its restart intensity. I think several, but not all, of these situations could be solved with await_registered/down functions in an |
You can use Director too. When a child process crashes, Director calls a callback function which that callback should tell how to deal with crash. Possible options are: Restart child, Restart child after time interval, Delete child from children, Do nothing or Terminate yourself. |
I'm closing this for now as we don't think this is important enough to spend time on at the moment and there are too many unknowns to make a quick decision. |
We have encountered situations in production where a supervisor would use up all restart attempts within less than a millisecond, not giving transient errors any time to be cleared, which can bring down a whole system due to something as simple as a TCP port being held briefly by the OS before being reusable again, or even due to Erlang's own process registry not freeing a name quickly enough because of multicore cpu scheduling.
To fix this problem, this patch adds delayed child restart with incremental back-off. (This has been reworked from an older version, see http://erlang.org/pipermail/erlang-patches/2012-January/002575.html, and now piggy-backs on the more recent mechanism in supervisor.erl (e6e3179) which already ensures that each new restart is passed via the message dispatch.)
If immediate restart of a child fails, this delays further restart attempts in exponential increments from min_delay (at least 1 ms) up to max_delay. The maximum number of restarts is still decided by the intensity/period parameters, but this makes it possible to control how fast restarts will happen. Setting max_delay to zero disables delays.
The default value of max_delay has been set to 500 ms, i.e., does not get slower than two restart attempts per second, which seems a reasonable balance between hoping to restart as soon as possible but not retrying too quickly. Note though that unless you raise min_delay, it only reaches that level if restarting within (approx) 1ms, 2ms, 4ms, 8ms, 16ms etc. all have failed; this takes about 1s and up to 10 restart attempts, if the intensity/period setting allows it.
Because we consider the current behaviour to be a bug, which makes supervision trees completely useless in these naturally occurring cases of transient problems blocking a restart, we think that the backoff mechanism should be enabled by default, and not have to be explicitly enabled.