[erlang-patches] delayed child restart with incremental back-off
Richard Carlsson
carlsson.richard@REDACTED
Thu Jan 5 20:06:09 CET 2012
On 01/04/2012 06:27 PM, Michael Truog wrote:
> I am not a decision maker with Erlang/OTP, but this patch just
> bothered me since it seems like solving a concurrency problem by
> inserting a sleep statement, which I have always thought is a wrong
> approach (to a synchronous/blocking problem). I think it is better
> to solve the problems which prevents any resources from being
> cleaned-up quickly, since the only time a terminate function will not
> be called is with the kill exit exception, so the terminate function
> would not be called because of a higher-level failure which makes it
> irrelevant. I agree that having a supervisor doing 1000 restarts
> immediately can be a concern for any programmer, but this just
> doesn't seem like a good approach to the problem (to me), since you
> are increasing the complexity in the middle of the critical
> supervision structure with what reduces to a sleep statement used to
> solve a concurrency problem.
In general, yes, synchronization issues should not be solved naively by
sleeping. But the restart problem depends on a race condition that the
programmer has no control over. Regardless of whether the terminate
callback gets called, shared resources by definition live outside the
process that allocates them. (This includes locally registered names,
even though their allocation is done by the gen_server code and the
freeing is implicit when the process dies.) In a multicore environment,
this means that the freeing is not instantaneous, but can be handled by
a thread on one CPU (and might not even be scheduled yet, or is waiting
on a lock) while the supervisor is running in another thread on another
CPU and has already proceeded to try to restart the worker, which will
fail since there is currently no mimimum delay at all between attemtps
and the limit is reached before the freeing thread has had time to run.
In some cases (ethernet, locks, database transactions, ...) a simple
delay and retry is the best solution, and I think supervision is such a
case. In order to solve it in the user code, you'd have to program very
defensively and make sure to implement your own retry loops for each
point in the code where you try to grab a resource. That's not the "let
it crash" way (and you'd still be solving the problem by sleeping).
It's also nice to be able to see the occurrence of such repeated
attempts in the logs, which lets you locate the issues and see if you
can do something to avoid them. Definitely better than seeing a row of
100 failed restarts all with the same timestamp, after which the
supervisor gave up and the node died.
Of course it is sensitive to make changes in the supervisor code, so
this change obviously needs proper reviewing and testing, but hopefully,
it will then "just work" and solve this basic restart issue for all
users forever. While it is being evaluated, one could set the default
upper limit for delays to zero, so that supervisors in existing code
behave exactly as before; with the supervisor proplists patch I posted,
it would then be possible to activate delayed restarts on selected
supervisors in new code. Eventually, you could change this to make
delayed restarts (with some tried and trusted back-off parameters) be
the default in OTP.
/Richard
More information about the erlang-patches
mailing list