[erlang-patches] delayed child restart with incremental back-off

Richard Carlsson carlsson.richard@REDACTED
Wed Jan 4 12:02:29 CET 2012


On 01/04/2012 02:00 AM, Michael Truog wrote:
> On 01/03/2012 02:55 PM, Richard Carlsson wrote:
>> Asynchronous delayed child restart with incremental back-off
>>
>> Use a queue instead of a list to track restarts (avoiding linear
>> time complexity), and delay restarting of children incrementally
>> from min_delay (at least 1 ms) up to max_delay if an immediate
>> restart fails. The maximum number of restarts is still decided by
>> the intensity/period parameters, but this makes it possible to
>> control how fast (or slow) restarts will happen.
>
> Isn't this a feature that would hide errors, since all resources
> should be freed within a behavior terminate function?  The supervisor
> controls how long the terminate function could take, and that
> provides a maximum time in-between restarts.  However, if you allow
> the time in-between restarts to grow, it seems like you would just be
> hiding errors with the resources that should have been freed, but
> were not due to buggy code.  So, it seems like this feature would be
> contrary to a "fail fast" mentality.

The behaviour terminate function is alas not always called, depending on 
the nature of the crash. And there are other things that can delay the 
freeing of a resource; for example, the OS could hold on to a port for a 
brief time, preventing a TCP server from restarting. I think that user 
code should indeed fail fast, but supervisors are a different matter. A 
supervisor which does 1000 restart attempts in the blink of an eye and 
then gives up and crashes the node is indeed a way to fail fast and 
spectacularly, but is probably not what the programmer expected when he 
said "retry at most 1000 times".

So I think that the supervisor should support delays between restart 
attempts. A question is of course what the defaults should be: starting 
at 1 ms seems good, but where should the backoff be capped by default? 
As much as 1 second between attempts, or perhaps as low as 10 ms? (If 
you set the cap to 0, you get the old behaviour.)

    /Richard



More information about the erlang-patches mailing list