[erlang-patches] delayed child restart with incremental back-off

Wed Jan 4 18:27:57 CET 2012

On 01/04/2012 03:02 AM, Richard Carlsson wrote:
> On 01/04/2012 02:00 AM, Michael Truog wrote:
>> On 01/03/2012 02:55 PM, Richard Carlsson wrote:
>>> Asynchronous delayed child restart with incremental back-off
>>>
>>> Use a queue instead of a list to track restarts (avoiding linear
>>> time complexity), and delay restarting of children incrementally
>>> from min_delay (at least 1 ms) up to max_delay if an immediate
>>> restart fails. The maximum number of restarts is still decided by
>>> the intensity/period parameters, but this makes it possible to
>>> control how fast (or slow) restarts will happen.
>>
>> Isn't this a feature that would hide errors, since all resources
>> should be freed within a behavior terminate function?  The supervisor
>> controls how long the terminate function could take, and that
>> provides a maximum time in-between restarts.  However, if you allow
>> the time in-between restarts to grow, it seems like you would just be
>> hiding errors with the resources that should have been freed, but
>> were not due to buggy code.  So, it seems like this feature would be
>> contrary to a "fail fast" mentality.
>
> The behaviour terminate function is alas not always called, depending on the nature of the crash. And there are other things that can delay the freeing of a resource; for example, the OS could hold on to a port for a brief time, preventing a TCP server from restarting. I think that user code should indeed fail fast, but supervisors are a different matter. A supervisor which does 1000 restart attempts in the blink of an eye and then gives up and crashes the node is indeed a way to fail fast and spectacularly, but is probably not what the programmer expected when he said "retry at most 1000 times".
>
> So I think that the supervisor should support delays between restart attempts. A question is of course what the defaults should be: starting at 1 ms seems good, but where should the backoff be capped by default? As much as 1 second between attempts, or perhaps as low as 10 ms? (If you set the cap to 0, you get the old behaviour.)
>

I am not a decision maker with Erlang/OTP, but this patch just bothered me since it seems like solving a concurrency problem by inserting a sleep statement, which I have always thought is a wrong approach (to a synchronous/blocking problem).  I think it is better to solve the problems which prevents any resources from being cleaned-up quickly, since the only time a terminate function will not be called is with the kill exit exception, so the terminate function would not be called because of a higher-level failure which makes it irrelevant.  I agree that having a supervisor doing 1000 restarts immediately can be a concern for any programmer, but this just doesn't seem like a good approach to the problem (to me), since you are increasing the complexity in the middle of the critical supervision structure with what reduces to a sleep statement used to solve a concurrency problem.