[erlang-questions] Incorrect supervisor behaviour and crash when mixing temporary and permanent children

Frédéric Trottier-Hébert <>
Wed Aug 3 18:42:25 CEST 2011


This depends on the relationship they have. There's no one-size-fits-all approach there, but I would say that it is likely that you want short-lived permanent task usually higher up in the tree hierarchy than short-lived tasks. However, there is nothing keeping you from putting them under the same spot for some reason.

In any case, temporary processes should never be restarted (especially given the supervisor module doesn't track the Args anymore). I've submitted a patch for the behaviour here: http://erlang.org/pipermail/erlang-patches/2011-August/002248.html

--
Fred Hébert
http://www.erlang-solutions.com



On 2011-08-03, at 12:38 PM, OvermindDL1 wrote:

> From my understanding, permanent and temporary processes should be under different supervisors, generally the temp supervisor under the perm supervisor.
> 
> On Aug 3, 2011 6:21 AM, "Frédéric Trottier-Hébert" <> wrote:
> > 
> > 
> > On 2011-08-02, at 21:13 PM, Sam Bobroff wrote:
> > 
> >> Hello Erlangers,
> >> 
> >> I've recently tracked down a bug in some code that seems to be caused by a problem with the supervisor module. What seems to be happening is that when a temporary child has been added to a one_for_all supervisor using start_child, if a permanent child of that supervisor exits then if that permanent child exits, the supervisor attempts to restart the temporary child (which seems counter to the documentation).
> > 
> > That one is fine. The 'one_for_all' is about each process supervised. In that case, the permanent child means 'when I crash, get the one_for_all strategy'. That strategy is applied no matter what the other kinds of children are. On the other hand, in that same supervisor, if the temporary child dies, then the permanent one won't be restarted because that restart strategy means 'in my case, I'm not that vital and you can ignore me dying'. As far as I know, this is expected behaviour.
> > 
> > 
> >> In addition the MFA for the child has been replaced with undefined and start attempt crashes, causing the supervisor to continue attempting restarts until it reaches it's restart intensity and shuts down.
> >> 
> >> This does not happen if the temporary child is added to the supervisor as part of the supervisor's initial spec. In this case the temporary child is restart but it's MFA is present so it is able to restart it successfully, although it still seems contrary to the documentation that the temporary child is restarted at all, although it depends on which section of the documentation you regard as more important (from the supervisor documentation):
> > 
> > Ah, that one gets trickier. If I take a look at the source, the issue seems to be in the supervisor's 'save_child' function. This one has this comment:
> > 
> > %% Note we do not want to save the parameter list for temporary processes as
> > %% they will not be restarted, and hence we do not need this information.
> > %% Especially for dynamic children to simple_one_for_one supervisors
> > %% it could become very costly as it is not uncommon to spawn
> > %% very many such processes.
> > 
> > So the arguments that you pass to 'proc_lib:start_link' get dropped. This check DOES NOT happen when the Child Spec is part of the initial supervisor specification.
> > 
> > One issue here is that we have two different ways of starting temporary children. The real issue however, seems to be that we assume that temporary processes will NOT be restarted, while the supervisor does attempt to do so.
> > 
> >> 
> >> 
> >> one_for_all - if one child process terminates and should be restarted, all other child processes are terminated and then all child processes are restarted.
> >> 
> >> This is quite clear: "all child processes are restarted". But later:
> >> 
> >> 
> >> Restart defines when a terminated child process should be restarted. A permanent child process should always be restarted, a temporary child process should never be restarted and a transient child process should be restarted only if it terminates abnormally, i.e. with another exit reason than normal.
> >> 
> >> 
> >> 
> >> Again clear but conflicting: "a temporary child process should never be restarted".
> > 
> > These rules are about the processes themselves as far as I know. It's well possible to expect a transient process to be restarted if it depends on a permanent one that crashed. I'm not sure I get why this would not be the case with a temporary child.
> >> 
> >> I found a similar issue mentioned in a bug report for R14B02, here:
> >> 
> >> http://erlang.org/pipermail/erlang-bugs/2011-March/002273.html
> > 
> > This is an interesting entry because it does suggest a fix that's still there and helps with the issue, but only when the process that crashed was the one we were looking for. The function now looks like this:
> > 
> > del_child(Name, [Ch|Chs]) when Ch#child.name =:= Name, Ch#child.restart_type =:= temporary ->
> > Chs;
> > del_child(Name, [Ch|Chs]) when Ch#child.name =:= Name ->
> > [Ch#child{pid = undefined} | Chs];
> > del_child(Pid, [Ch|Chs]) when Ch#child.pid =:= Pid, Ch#child.restart_type =:= temporary ->
> > Chs;
> > del_child(Pid, [Ch|Chs]) when Ch#child.pid =:= Pid ->
> > [Ch#child{pid = undefined} | Chs];
> > del_child(Name, [Ch|Chs]) ->
> > [Ch|del_child(Name, Chs)];
> > del_child(_, []) ->
> > [].
> > 
> > The issue there is that we're still giving temporary children in the list of processes to be restarted if they were not the ones to crash (Name/Pid). Then the sequent calls to 'terminate_children' and 'do_terminate_children' as part of the restart procedure also never clear these up, and the 'start_children' function doesn't do it either.
> > 
> > In normal cases, it seems the supervisor will use state_del_child as a function to do it, which doesn't happen here (we're using del_child). 
> > ---
> > 
> > If anyone from the OTP team is looking, it'd be nice to have a confirm/deny on that explanation. I've tried a small patch, but haven't tested it in the large or recompiled anything:
> > 
> > 737a738,740
> >> terminate_children([Child = #child{restart_type=temporary} | Children], SupName, Res) ->
> >> do_terminate(Child, SupName),
> >> terminate_children(Children, SupName, Res);
> > 
> > Just adding these lines to R14B03 seems to resolve the issue by avoiding to restart the child in the first place.
> > I can make a standard patch and bug submission if required/appreciated.
> > 
> > --
> > Fred Hébert
> > http://www.erlang-solutions.com
> > 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20110803/ddd53e2a/attachment.html>


More information about the erlang-questions mailing list