[erlang-patches] delayed child restart with incremental back-off
Richard Carlsson
carlsson.richard@REDACTED
Wed Jan 4 12:02:29 CET 2012
On 01/04/2012 02:00 AM, Michael Truog wrote:
> On 01/03/2012 02:55 PM, Richard Carlsson wrote:
>> Asynchronous delayed child restart with incremental back-off
>>
>> Use a queue instead of a list to track restarts (avoiding linear
>> time complexity), and delay restarting of children incrementally
>> from min_delay (at least 1 ms) up to max_delay if an immediate
>> restart fails. The maximum number of restarts is still decided by
>> the intensity/period parameters, but this makes it possible to
>> control how fast (or slow) restarts will happen.
>
> Isn't this a feature that would hide errors, since all resources
> should be freed within a behavior terminate function? The supervisor
> controls how long the terminate function could take, and that
> provides a maximum time in-between restarts. However, if you allow
> the time in-between restarts to grow, it seems like you would just be
> hiding errors with the resources that should have been freed, but
> were not due to buggy code. So, it seems like this feature would be
> contrary to a "fail fast" mentality.
The behaviour terminate function is alas not always called, depending on
the nature of the crash. And there are other things that can delay the
freeing of a resource; for example, the OS could hold on to a port for a
brief time, preventing a TCP server from restarting. I think that user
code should indeed fail fast, but supervisors are a different matter. A
supervisor which does 1000 restart attempts in the blink of an eye and
then gives up and crashes the node is indeed a way to fail fast and
spectacularly, but is probably not what the programmer expected when he
said "retry at most 1000 times".
So I think that the supervisor should support delays between restart
attempts. A question is of course what the defaults should be: starting
at 1 ms seems good, but where should the backoff be capped by default?
As much as 1 second between attempts, or perhaps as low as 10 ms? (If
you set the cap to 0, you get the old behaviour.)
/Richard
More information about the erlang-patches
mailing list