[erlang-patches] delayed child restart with incremental back-off

Richard Carlsson carlsson.richard@REDACTED
Tue Jan 3 23:55:25 CET 2012


Asynchronous delayed child restart with incremental back-off

Use a queue instead of a list to track restarts (avoiding linear time 
complexity), and delay restarting of children incrementally from 
min_delay (at least 1 ms) up to max_delay if an immediate restart fails. 
The maximum number of restarts is still decided by the intensity/period 
parameters, but this makes it possible to control how fast (or slow) 
restarts will happen.

  git fetch git://github.com/richcarl/otp.git supervisor-restart-delay

This fixes the problem when a supervisor tries to restart a crashed 
child but for various reasons (in particular, because modern Erlang on a 
multicore machine has way more concurrency going on) some resources that 
were held by the dead process have not been released quite yet - this 
can happen even with a simple registered name - and thus the restart 
fails, and immediately fails again, and again, until the supervisor 
gives up and shuts down; this can *bring down the whole node*, because 
the restart attempts get exhausted within an extremenly short time 
period, even if you have multiple supervisors. Usually, waiting for just 
a millisecond is enough to let the resource be released so the process 
can restart, but sometimes a bit longer is needed. This patch implements 
an incremental back-off which starts at 1 ms and is capped at 1 second 
between restart attempts. These parameters are currently hard-coded, but 
that could change if my supervisor options patch is accepted. The delay 
is handled asynchronously, so the supervisor is not unresponsive while a 
restart is delayed.

    /Richard



More information about the erlang-patches mailing list