[erlang-questions] Incorrect supervisor behaviour and crash when mixing temporary and permanent children
OvermindDL1
overminddl1@REDACTED
Wed Aug 3 18:38:41 CEST 2011
>From my understanding, permanent and temporary processes should be under
different supervisors, generally the temp supervisor under the perm
supervisor.
On Aug 3, 2011 6:21 AM, "Frédéric Trottier-Hébert" <
fred.hebert@REDACTED> wrote:
>
>
> On 2011-08-02, at 21:13 PM, Sam Bobroff wrote:
>
>> Hello Erlangers,
>>
>> I've recently tracked down a bug in some code that seems to be caused by
a problem with the supervisor module. What seems to be happening is that
when a temporary child has been added to a one_for_all supervisor using
start_child, if a permanent child of that supervisor exits then if that
permanent child exits, the supervisor attempts to restart the temporary
child (which seems counter to the documentation).
>
> That one is fine. The 'one_for_all' is about each process supervised. In
that case, the permanent child means 'when I crash, get the one_for_all
strategy'. That strategy is applied no matter what the other kinds of
children are. On the other hand, in that same supervisor, if the temporary
child dies, then the permanent one won't be restarted because that restart
strategy means 'in my case, I'm not that vital and you can ignore me dying'.
As far as I know, this is expected behaviour.
>
>
>> In addition the MFA for the child has been replaced with undefined and
start attempt crashes, causing the supervisor to continue attempting
restarts until it reaches it's restart intensity and shuts down.
>>
>> This does not happen if the temporary child is added to the supervisor as
part of the supervisor's initial spec. In this case the temporary child is
restart but it's MFA is present so it is able to restart it successfully,
although it still seems contrary to the documentation that the temporary
child is restarted at all, although it depends on which section of the
documentation you regard as more important (from the supervisor
documentation):
>
> Ah, that one gets trickier. If I take a look at the source, the issue
seems to be in the supervisor's 'save_child' function. This one has this
comment:
>
> %% Note we do not want to save the parameter list for temporary processes
as
> %% they will not be restarted, and hence we do not need this information.
> %% Especially for dynamic children to simple_one_for_one supervisors
> %% it could become very costly as it is not uncommon to spawn
> %% very many such processes.
>
> So the arguments that you pass to 'proc_lib:start_link' get dropped. This
check DOES NOT happen when the Child Spec is part of the initial supervisor
specification.
>
> One issue here is that we have two different ways of starting temporary
children. The real issue however, seems to be that we assume that temporary
processes will NOT be restarted, while the supervisor does attempt to do so.
>
>>
>>
>> one_for_all - if one child process terminates and should be restarted,
all other child processes are terminated and then all child processes are
restarted.
>>
>> This is quite clear: "all child processes are restarted". But later:
>>
>>
>> Restart defines when a terminated child process should be restarted. A
permanent child process should always be restarted, a temporary child
process should never be restarted and a transient child process should be
restarted only if it terminates abnormally, i.e. with another exit reason
than normal.
>>
>>
>>
>> Again clear but conflicting: "a temporary child process should never be
restarted".
>
> These rules are about the processes themselves as far as I know. It's well
possible to expect a transient process to be restarted if it depends on a
permanent one that crashed. I'm not sure I get why this would not be the
case with a temporary child.
>>
>> I found a similar issue mentioned in a bug report for R14B02, here:
>>
>> http://erlang.org/pipermail/erlang-bugs/2011-March/002273.html
>
> This is an interesting entry because it does suggest a fix that's still
there and helps with the issue, but only when the process that crashed was
the one we were looking for. The function now looks like this:
>
> del_child(Name, [Ch|Chs]) when Ch#child.name =:= Name,
Ch#child.restart_type =:= temporary ->
> Chs;
> del_child(Name, [Ch|Chs]) when Ch#child.name =:= Name ->
> [Ch#child{pid = undefined} | Chs];
> del_child(Pid, [Ch|Chs]) when Ch#child.pid =:= Pid, Ch#child.restart_type
=:= temporary ->
> Chs;
> del_child(Pid, [Ch|Chs]) when Ch#child.pid =:= Pid ->
> [Ch#child{pid = undefined} | Chs];
> del_child(Name, [Ch|Chs]) ->
> [Ch|del_child(Name, Chs)];
> del_child(_, []) ->
> [].
>
> The issue there is that we're still giving temporary children in the list
of processes to be restarted if they were not the ones to crash (Name/Pid).
Then the sequent calls to 'terminate_children' and 'do_terminate_children'
as part of the restart procedure also never clear these up, and the
'start_children' function doesn't do it either.
>
> In normal cases, it seems the supervisor will use state_del_child as a
function to do it, which doesn't happen here (we're using del_child).
> ---
>
> If anyone from the OTP team is looking, it'd be nice to have a
confirm/deny on that explanation. I've tried a small patch, but haven't
tested it in the large or recompiled anything:
>
> 737a738,740
>> terminate_children([Child = #child{restart_type=temporary} | Children],
SupName, Res) ->
>> do_terminate(Child, SupName),
>> terminate_children(Children, SupName, Res);
>
> Just adding these lines to R14B03 seems to resolve the issue by avoiding
to restart the child in the first place.
> I can make a standard patch and bug submission if required/appreciated.
>
> --
> Fred Hébert
> http://www.erlang-solutions.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20110803/59087dac/attachment.htm>
More information about the erlang-questions
mailing list