[erlang-patches] delayed child restart with incremental back-off
Richard Carlsson
carlsson.richard@REDACTED
Tue Jan 3 23:55:25 CET 2012
Asynchronous delayed child restart with incremental back-off
Use a queue instead of a list to track restarts (avoiding linear time
complexity), and delay restarting of children incrementally from
min_delay (at least 1 ms) up to max_delay if an immediate restart fails.
The maximum number of restarts is still decided by the intensity/period
parameters, but this makes it possible to control how fast (or slow)
restarts will happen.
git fetch git://github.com/richcarl/otp.git supervisor-restart-delay
This fixes the problem when a supervisor tries to restart a crashed
child but for various reasons (in particular, because modern Erlang on a
multicore machine has way more concurrency going on) some resources that
were held by the dead process have not been released quite yet - this
can happen even with a simple registered name - and thus the restart
fails, and immediately fails again, and again, until the supervisor
gives up and shuts down; this can *bring down the whole node*, because
the restart attempts get exhausted within an extremenly short time
period, even if you have multiple supervisors. Usually, waiting for just
a millisecond is enough to let the resource be released so the process
can restart, but sometimes a bit longer is needed. This patch implements
an incremental back-off which starts at 1 ms and is capped at 1 second
between restart attempts. These parameters are currently hard-coded, but
that could change if my supervisor options patch is accepted. The delay
is handled asynchronously, so the supervisor is not unresponsive while a
restart is delayed.
/Richard
More information about the erlang-patches
mailing list