[erlang-patches] delayed child restart with incremental back-off

Thu Jan 5 20:06:09 CET 2012

On 01/04/2012 06:27 PM, Michael Truog wrote:
> I am not a decision maker with Erlang/OTP, but this patch just
> bothered me since it seems like solving a concurrency problem by
> inserting a sleep statement, which I have always thought is a wrong
> approach (to a synchronous/blocking problem).  I think it is better
> to solve the problems which prevents any resources from being
> cleaned-up quickly, since the only time a terminate function will not
> be called is with the kill exit exception, so the terminate function
> would not be called because of a higher-level failure which makes it
> irrelevant.  I agree that having a supervisor doing 1000 restarts
> immediately can be a concern for any programmer, but this just
> doesn't seem like a good approach to the problem (to me), since you
> are increasing the complexity in the middle of the critical
> supervision structure with what reduces to a sleep statement used to
> solve a concurrency problem.

In general, yes, synchronization issues should not be solved naively by 
sleeping. But the restart problem depends on a race condition that the 
programmer has no control over. Regardless of whether the terminate 
callback gets called, shared resources by definition live outside the 
process that allocates them. (This includes locally registered names, 
even though their allocation is done by the gen_server code and the 
freeing is implicit when the process dies.) In a multicore environment, 
this means that the freeing is not instantaneous, but can be handled by 
a thread on one CPU (and might not even be scheduled yet, or is waiting 
on a lock) while the supervisor is running in another thread on another 
CPU and has already proceeded to try to restart the worker, which will 
fail since there is currently no mimimum delay at all between attemtps 
and the limit is reached before the freeing thread has had time to run. 
In some cases (ethernet, locks, database transactions, ...) a simple 
delay and retry is the best solution, and I think supervision is such a 
case. In order to solve it in the user code, you'd have to program very 
defensively and make sure to implement your own retry loops for each 
point in the code where you try to grab a resource. That's not the "let 
it crash" way (and you'd still be solving the problem by sleeping).

It's also nice to be able to see the occurrence of such repeated 
attempts in the logs, which lets you locate the issues and see if you 
can do something to avoid them. Definitely better than seeing a row of 
100 failed restarts all with the same timestamp, after which the 
supervisor gave up and the node died.

Of course it is sensitive to make changes in the supervisor code, so 
this change obviously needs proper reviewing and testing, but hopefully, 
it will then "just work" and solve this basic restart issue for all 
users forever. While it is being evaluated, one could set the default 
upper limit for delays to zero, so that supervisors in existing code 
behave exactly as before; with the supervisor proplists patch I posted, 
it would then be possible to activate delayed restarts on selected 
supervisors in new code. Eventually, you could change this to make 
delayed restarts (with some tried and trusted back-off parameters) be 
the default in OTP.

     /Richard