Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supervisor restart backoff #1287

Closed
wants to merge 5 commits into from

Conversation

richcarl
Copy link
Contributor

We have encountered situations in production where a supervisor would use up all restart attempts within less than a millisecond, not giving transient errors any time to be cleared, which can bring down a whole system due to something as simple as a TCP port being held briefly by the OS before being reusable again, or even due to Erlang's own process registry not freeing a name quickly enough because of multicore cpu scheduling.

To fix this problem, this patch adds delayed child restart with incremental back-off. (This has been reworked from an older version, see http://erlang.org/pipermail/erlang-patches/2012-January/002575.html, and now piggy-backs on the more recent mechanism in supervisor.erl (e6e3179) which already ensures that each new restart is passed via the message dispatch.)

If immediate restart of a child fails, this delays further restart attempts in exponential increments from min_delay (at least 1 ms) up to max_delay. The maximum number of restarts is still decided by the intensity/period parameters, but this makes it possible to control how fast restarts will happen. Setting max_delay to zero disables delays.

The default value of max_delay has been set to 500 ms, i.e., does not get slower than two restart attempts per second, which seems a reasonable balance between hoping to restart as soon as possible but not retrying too quickly. Note though that unless you raise min_delay, it only reaches that level if restarting within (approx) 1ms, 2ms, 4ms, 8ms, 16ms etc. all have failed; this takes about 1s and up to 10 restart attempts, if the intensity/period setting allows it.

Because we consider the current behaviour to be a bug, which makes supervision trees completely useless in these naturally occurring cases of transient problems blocking a restart, we think that the backoff mechanism should be enabled by default, and not have to be explicitly enabled.

@ferd
Copy link
Contributor

ferd commented Dec 15, 2016

I've been arguing against such stuff before because it feels to me that supervisors that require that kind of back off are not supervising processes correctly.

I've written It's About The Guarantees on this topic before. The gist of it is that the supervision structure should encode the guarantees of its workers for me.

Supervised processes provide guarantees in their initialization phase, not a best effort. This means that when you're writing a client for a database or service, you shouldn't need a connection to be established as part of the initialization phase unless you're ready to say it will always be available no matter what happens.

In the case of connections you can expect to fail, the only guarantee you can make in the client process is that your client will be able to handle requests, but not that it will communicate to the database. It could return {error, not_connected} on all calls during failures for example.

So what I gather when I see demands for such supervisors is usually that someone is encoded a guarantee that the client will be connected as part of their supervision structure, but fully understand that they can't guarantee the connection.

If immediate restart of a child fails, delay further restart attempts in
exponential increments from min_delay (at least 1 ms) up to max_delay. The
maximum number of restarts is still decided by the intensity/period
parameters, but this makes it possible to control how fast restarts will
happen. Setting max_delay to zero disables delays.

Before this change, it could easily happen that a supervisor would use up
all restart attempts within less than a millisecond, not giving transient
errors any time to be cleared. This could bring down a whole system due to
something as simple as a TCP port being held briefly by the OS before being
reusable again, or even Erlang's own process registry not freeing a name
quickly enough due to multicore cpu scheduling.
@richcarl richcarl force-pushed the supervisor-restart-backoff branch from 8c91036 to 8f0ad40 Compare December 15, 2016 17:27
@richcarl
Copy link
Contributor Author

Please read my answers in that old mail thread first. In cases like these that I describe, it is simply not possible to handle it in a reasonable way from outside the supervisor. Neither is it reasonable to tell someone that yes, sure, they spent time setting up a supervision hierarchy, but it can actually fall over completely and bring down the system due to a small lag in resource management on the OS level. If there is to be any point at all with repeated supervision restarts, there needs to be a backoff.

@essen
Copy link
Contributor

essen commented Dec 15, 2016

I'll add that in the case of connection management, while currently the best practice is to not do it in init and retry in your own process until you finally manage to connect, this implies that you have to reimplement this retry logic in all your processes (with or without a third party library), for something that could be done easily at the supervisor level.

I also don't think Erlang developers are warned enough about possible issues. Something as simple as opening a file in init/2 could trigger many fast supervisor restarts if you run out of fds or there's a temporary disk error. Are you always that careful about what you call in init? I know I'm not. Sometimes you call third party library functions without knowing whether they are safe to call. A quick glance into OTP source shows me at least one issue in the auth module; the cookie file is read on init.

I know that in an ideal world everyone is extra careful about every possible resource issues, but in the real world software is flawed and you can't account for every possible issues.

I don't have a strong opinion about whether this should be the default, I think it should with reasonable values; but it should be at the very least be available when I flick a switch.

@okeuday
Copy link
Contributor

okeuday commented Dec 15, 2016

There are a few issues which may not be solved or tracked elsewhere which make me wonder if Erlang/OTP would benefit from a new supervisor module, to avoid legacy problems. This list is not meant to be exhaustive, just things that seem like they will be a constant annoyance:

  1. transient children childspecs never get removed automatically (after normal termination), making them not transient
  2. Shutdown > MaxT / MaxR will cause infinite restarts (I assume this hasn't been fixed yet)
  3. Nothing is currently limiting the MaxT based on the supervision hierarchy structure, though there is always a limit due to the parent supervisor, which must be currently managed manually (either way, it will always limit the height of a supervision hierarchy)

I know (1) above can't be fixed due to legacy usage of the current supervisor module and that has already been discussed on the bugs email list. (2) should be fixable, but is generally avoided when people know what they are doing, it is just something good to enforce to avoid broken source code. (3) may seem like it requires global state (ets), but ideally the parent supervisor could provide this information to the child, so the limit is enforced without global state.

@okeuday
Copy link
Contributor

okeuday commented Dec 15, 2016

@richcarl Also, one of the reasons why I was thinking about the contents of comment above (#1287 (comment)) is that we might also want a constant delay, to avoid the ambiguity with exponential growth, even with a low max_delay that can obviously suppress this exponential growth. Not saying that it should be added now, since I am not trying to distract from this discussion or the changes as-is, just hoping that there is nothing preventing its addition later, if there is proper support for doing so.

@richcarl
Copy link
Contributor Author

@ferd Note that this is first and foremost about surviving brief transient problems, not 15 minutes of "NFS server not responding still trying". It's also about acquiring resources that are absolutely necessary for the process: a TCP server is no good if it can't get hold of the port it's supposed to listen to, and an Erlang registered process that cannot register its name cannot be used. In such cases, they are right to crash. I'm also sure you're not advocating that all Erlang programmers should get in the habit of writing wait-and-retry loops around every call to register/2, or add calls to timer:sleep() at the start of each gen_server init() to make it more likely that their resources are available. That is not the Erlang/OTP way - we should be writing code for the happy path and rely on the OTP framework for the difficult parts of error handling.

The thing is that resources like these are managed outside the scope of the code; by the VM, or the OS. Suppose you have a supervised service that was working but encountered an obscure bug and crashed. The supervisor restarts it, but because of delays elsewhere, it is not able to re-acquire its critical resources just yet (even though you specified REUSEADDRESS, there can be a brief delay). Without backoff, the whole supervision tree will reach its restart limit within a moment (modern CPUs are fast...), and the whole thing gets escalated to a VM restart or even a full host reboot. All because nobody tried to slow down all that restarting for a millisecond or two.

@richcarl
Copy link
Contributor Author

A slightly more convoluted situation that I've also seen bring down a node, is when one child (ch1) crashes repeatedly causing the immediate supervisor (sup2) to give up, terminating the remaining children ch2-chN and escalating to the next supervisor (sup1), which tries to restart sup2 and everything under it.

The shutdown of children is however not completely monitored and synchronous in a transitive way. If one of ch2-chN have linked subprocesses, those processes will have received exit signals but may still be alive when sup1 tries to restart sup2, and that may prevent the new instances of ch1-chN to start up, for example because they are still holding a registered name, named ets table or similar global resource. This kills the system.

@ferd
Copy link
Contributor

ferd commented Dec 16, 2016

Right, the use case of held names or tables is a very interesting one that likely is going to show up more often as more and more cores (and possibly more NIFs!) get to see widespread use, these race conditions could show up more and more.

As much as I'd like to go "well just do it in the terminate function", the argument breaks down easily there since not all resources see this possibility, and it would not be clean to unregister a name that was registered through the gen module either -- but maybe that's a bit of an oversight in the gen module if that unregistration is not attempted as a clean up phase and instead leaving it to asynchronous detection (which still doesn't get rid of the ETS issue though).

At this point, I'm still not convinced exponential or increasing backoffs are a good strategy to put in a supervisor (and backoff shows many strategies of increasing backoffs could be preferred).

The way I see it, to work fine without impacting the ability of a supervisor to shut itself down in case of repeated failures, the max delay would have to be a function of the current MaxT and MaxR values such that even at max delay, not succeeding within the allowed range of the supervisor can still see its demise.

so if I have a supervisor whose tolerance is 3 restarts per 10 seconds (or 10,000 ms), I should be able to run all of my attempts fast enough to cause its death in order to preserve the "guarantees" approach without making a general exponential backoff mechanism:

1> (timer:seconds(10) div 3).
3333

meaning that my dynamically chosen max delay should be 3333ms in order to ensure an ability to kill my supervisor no matter what (a restart frequency of 0 should not be even a thing to look in there).

To avoid wide changes in behaviours, I still think that the restarts should be short; whether exponential or fibonacci growth is best I'm not sure, but that MaxR and MaxT value is a surefire top restart value to avoid messing up with most of the supervisor semantics.

There's an interesting question on whether this should have an effect on apps that accidentally expect restarts to take place as fast as possible.

I'm kind of wondering if leaving a default of max restart value to 0 would be good (or just 1ms to help alleviate a bit?), but then you could let the user override it with a logic a bit like:

MaxRestartDelay = min(timer:seconds(MaxT) div MaxR, proplists:get_value(max_restart, L, Default)),
...

which would still let:
a.1) defaults remain unchanged if Default=0
a.2) defaults look similar but the race conditions are caught in more frequently with Default=1
b) people have the ability to increase the value as they see fit
c) people do not have the ability to increase the value in a way that destroys restart tolerance

In all cases, this requires some interesting thinking. We're entering what feels like a very uncomfortable non-deterministic solution to an already nondeterministic problem, which feels wrong with supervisors.

@dgud
Copy link
Contributor

dgud commented Dec 16, 2016

To correct some old/false information, erlang resources like registered name and ets tables are cleaned up before any monitor (or link) down is sent. This was fixed in OTP-R11.

okeuday added a commit to CloudI/CloudI that referenced this pull request Dec 16, 2016
…nstance to delay a restart (attempting to avoid spurious failures when the failure cause requires a short amount of time to be resolved) based on the suggestions of Richard Carlsson in erlang/otp#1287 . Fix service restarts to ensure all terminate functions have completed (or a kill exit has been sent) before restarting the service instance. Fix service configuration option timeout_terminate value output in log output of the service configuration.
okeuday added a commit to CloudI/cloudi_core that referenced this pull request Dec 16, 2016
…nstance to delay a restart (attempting to avoid spurious failures when the failure cause requires a short amount of time to be resolved) based on the suggestions of Richard Carlsson in erlang/otp#1287 . Fix service restarts to ensure all terminate functions have completed (or a kill exit has been sent) before restarting the service instance. Fix service configuration option timeout_terminate value output in log output of the service configuration.
@richcarl
Copy link
Contributor Author

@dgud That's not what I was referring to, but the case when a process that owns a resource has been sent an exit signal to kill it off, but it hasn't been scheduled to process that signal yet, and is therefore still alive when someone tries to start a new process to take ownership of the same resource.

@richcarl
Copy link
Contributor Author

@ferd Yes, as runtime systems and OS:es get more and more adapted to multicore, they will look more and more like clouds of services in separate threads on possibly separate cores. This means that all requests to the system like "release this resource" will turn into "ask the system to release this resource at its leisure" rather than being atomic and synchronous operations that don't return control until the operation has completed. You'll thus start to see more and more of these kinds of race situations at the platform level. Our code needs to get rid of assumptions about things happening immediately.

And I do think that the backoff needs to be part of the code that handles the retries, much like in a transaction manager, in Ethernet, or similar. It's a simple mechanism to ensure that if you restart more than once, you don't blow your entire restart quota within a microsecond - which undermines the whole idea of a supervision tree. Not having a backoff there is, well, a bug. There is no reason to think that if one restart attempt failed, another restart with the same parameters within the same fraction of time will magically work.

I see no need to overengineer this; I doubt very much that for this use case, you'll ever see any advantage of a fibonacci, square, or other sequence. With binary exponential backoff, you quickly hone in on the necessary wait time without using too many restart attempts, when you have no other clues to guide you. (In practice, in the crashing cases that I have described, 1-2 iterations has been enough to make the problem just go away, so you hardly notice the repeated attempts.)

And by default, the restarts are short. It will try a few really short delays first: 0 ms, 1 ms, 2, 4, ... but if it keeps failing it also fairly quickly grows up to more "human-level" delays: 128 ms, 256 ms, before the default max limit of 500 ms. Even an app that assumes very fast restarts would need to burn through 5 failed restart attempts before even reaching 10 ms delays.

Using MaxT/MaxR as a hard limit might be a good idea though, for safety. If you, as you say, only want to tolerate 3 restarts per 10 seconds, you'd hit that intensity within about 3 ms (0 + 1 + 2) with the default delay settings. As I said in a previous comment, by default you'll use up your first 10 attempts within the first second. In 4 seconds, you'll be on the 12th restart and in 16 seconds on the 14th restart. This is well within the typical intensity limits usually used in supervisors, so yes, they will keep terminating as they should - that part I had already considered. But if you have kind-of-weird settings like 15+ restarts per 30 seconds, you'd end up with too few restarts in the time window to trigger the intensity limit. By respecting the max delay setting but also not allowing any delays longer than MaxR/MaxT (for 15 restarts per 30 seconds, that's 2000 ms per restart), not even those apps will have to be modified. I'll amend my patch.

By the way, do look at #1289 and see what you think about the way I have described how to get a grip on intensity/period settings.

@kape1395
Copy link
Contributor

kape1395 commented Dec 19, 2016 via email

@sirihansen
Copy link
Contributor

This is just to let you know that we are working on this, it is not forgotten. There is an old OTP Technical Board decision on this functionality which says "allow it, but don't make it default" - but that was in 2012, and it is probably a good idea to refresh the memory and have another Technical Board meeting about it.

@richcarl
Copy link
Contributor Author

richcarl commented Jan 4, 2017

I've been updating the calculation of the delay cap as Fred suggested, adding a couple more tests, and modifying how the delay is done so as to handle all cases. Work in progress (not pushed yet).

@sirihansen
Copy link
Contributor

That's fine. Please add commits only, i.e. do not amend. We can squash commits later if needed. Thanks!

@sirihansen
Copy link
Contributor

@richcarl, in this comment, do you mean a supervisor tree like this:

supervisortree1

@richcarl
Copy link
Contributor Author

@sirihansen I think that matches what I was describing.

@sirihansen
Copy link
Contributor

@richcarl Ok, would it be possible to explain this design choice a bit more? It is normally not a recommended structure. Maybe if the subprocess is short lived, for the purpose of executing a specific job, but not with any vital resources bound to it. The recommended way would be to add subprocess as another child under sup2, or even with a third supervisor in between. But I guess you have your reasons??

@richcarl
Copy link
Contributor Author

That structure is not something I'd recommend either, just an example of the kind of situations I've seen in real world code bringing down a system. The more general problem is with any kind of resource that is outside of the control of the supervised processes themselves (or even outside ERTS), like a TCP port managed by the OS. It could even be hardware that has some short delay before it allows you to reuse it.

@sirihansen
Copy link
Contributor

Ok, thanks. Just wanted to have all available info before the technical board meeting next week.

@sirihansen
Copy link
Contributor

First of all, we do not have a decision on this case yet. But we have discussed a bit, and here are our comments and concerns so far.

We feel that there is a risk that the proposed solution will cause new problems in other situations. Although it solves one particular problem it will affect a some other scenarios and we can not
disregard them by just saying "you should not do this" - that does not work. We need more concrete examples of the problem domain to understand if there might be another solution or a variant of the
proposed one, that could be desirable to implement, and that has clearer implications on all scenarios.

We do agree that the current behavior on failed restarts (using up all restart attempts in a very short time) may be a bad thing when facing transient problems of allocating resources.

It may also be the case that such problems occur more and more often due to fast machines and multicore. And we need to consider if we should have a standardized way of handling problems of this character.

There are many problems related to waiting for some resource. No doubt, other projects will find the backoff mechanism useful in solving their problem, possibly needing some small change or addition
(- features like the ones in Fred's 'backoff' library maybe). How should we meet such requests? Can we draw a clear line between cases where it is useful/intended and cases where it should not be used?

So we are also discussing if we need to aim for a bigger set of problems, and if we should really implement something, or if it is design rules that we need - and pointers to solutions and libraries
used for similar problems.

Some comments and questions on the implementation/solution:

  • The backoff mechanism will affect all crashes during restart, not only those caused by brief transient problems. This will cause longer time before escalating due to random crash during restart. Do we need to distinguish between cases where it is meaningful to back off and others?
  • If escalating to the next level of supervisor, then you will not get into the backoff mechanism in the current supervisor again, so backoff must be configured on each supervisor level. Or should the mechanism be used for the first start of a child as well?
  • The same backoff parameters are used for all children under a supervisor. Children might be of very different characters, so we think that the parameters should belong to the child instead.

In summary, we are a bit reluctant to include this solution since it does not eliminate the possibility of the problem to occur, it only reduces the probability of it happening.

And as I said, we haven't yet decided whether or not to include the PR. We would really like more discussions around it to help us make the correct decision. If possible, we would like more use cases,
specific ones, with examples of what it would take to solve the problem in a different way. And I hope we can keep having a constructive dialogue to make things clearer and ensure a prudent
solution.

@josevalim
Copy link
Contributor

We feel that there is a risk that the proposed solution will cause new problems in other situations. Although it solves one particular problem it will affect a some other scenarios and we can not
disregard them by just saying "you should not do this" - that does not work.

Wouldn't it then make more sense to release the implementation in this PR as a library to iron out some of the possible issues? As from Erlang 19, any supervisor module can be used at the root of the application tree, so there shouldn't be any blocker to build a backoff_supervisor as a 3rd party library.

@fishcakez
Copy link
Contributor

I did write a backoff_supervisor library a couple of years ago. To get consistent behaviour I added a number of limitations listed in the readme: https://github.com/fishcakez/backoff_supervisor#backoff_supervisor. Unfortunately I never got comfortable using it so I probably made sub-optimal design choices. I stopped using it within weeks and switched to manually handling it in the child processes.

I think the backoff strategy treats the symptoms and not the cause in most situations: a supervisor restarts a child without guaranteeing that the neighbours of the dead child have finished exiting/cleaning up. However I don't think there is a reliable way to inform a supervisor of a child's neighbours.

One issue I have seen is when one application is slow to restart a child (perhaps because the error bubbled up its supervision tree and the tree is blocked restarting earlier children), while a child in another application quickly churns through its restart intensity.

I think several, but not all, of these situations could be solved with await_registered/down functions in an init/1 call. I think for most of the other situations these rely on factors outside the control of the VM, and these are best approached using the "its about guarantees" post by @ferd because the supervision tree is not able to do anything to return the application to a good state - the bad state is outside of the VM.

@sirihansen sirihansen added the waiting waiting for changes/input from author label Jul 7, 2017
@pouriya
Copy link
Contributor

pouriya commented Feb 20, 2018

You can use Director too. When a child process crashes, Director calls a callback function which that callback should tell how to deal with crash. Possible options are: Restart child, Restart child after time interval, Delete child from children, Do nothing or Terminate yourself.

@uabboli uabboli added team:VM Assigned to OTP team VM and removed team:MW labels Aug 21, 2019
@garazdawi
Copy link
Contributor

I'm closing this for now as we don't think this is important enough to spend time on at the moment and there are too many unknowns to make a quick decision.

@garazdawi garazdawi closed this Feb 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature team:VM Assigned to OTP team VM waiting waiting for changes/input from author
Projects
None yet
Development

Successfully merging this pull request may close these issues.