[erlang-patches] delayed child restart with incremental back-off

Wed Jan 4 02:00:37 CET 2012

On 01/03/2012 02:55 PM, Richard Carlsson wrote:
> Asynchronous delayed child restart with incremental back-off
>
> Use a queue instead of a list to track restarts (avoiding linear time complexity), and delay restarting of children incrementally from min_delay (at least 1 ms) up to max_delay if an immediate restart fails. The maximum number of restarts is still decided by the intensity/period parameters, but this makes it possible to control how fast (or slow) restarts will happen.
>
>  git fetch git://github.com/richcarl/otp.git supervisor-restart-delay
>
> This fixes the problem when a supervisor tries to restart a crashed child but for various reasons (in particular, because modern Erlang on a multicore machine has way more concurrency going on) some resources that were held by the dead process have not been released quite yet - this can happen even with a simple registered name - and thus the restart fails, and immediately fails again, and again, until the supervisor gives up and shuts down; this can *bring down the whole node*, because the restart attempts get exhausted within an extremenly short time period, even if you have multiple supervisors. Usually, waiting for just a millisecond is enough to let the resource be released so the process can restart, but sometimes a bit longer is needed. This patch implements an incremental back-off which starts at 1 ms and is capped at 1 second between restart attempts. These parameters are currently hard-coded, but that could change if my supervisor options patch is accepted. The
> delay is handled asynchronously, so the supervisor is not unresponsive while a restart is delayed.

Isn't this a feature that would hide errors, since all resources should be freed within a behavior terminate function?  The supervisor controls how long the terminate function could take, and that provides a maximum time in-between restarts.  However, if you allow the time in-between restarts to grow, it seems like you would just be hiding errors with the resources that should have been freed, but were not due to buggy code.  So, it seems like this feature would be contrary to a "fail fast" mentality.