<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><br><div><div>On 2011-08-02, at 21:13 PM, Sam Bobroff wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">Hello Erlangers,<div><br></div><div>I've recently tracked down a bug in some code that seems to be caused by a problem with the supervisor module. What seems to be happening is that when a temporary child has been added to a one_for_all supervisor using start_child, if a permanent child of that supervisor exits then if that permanent child exits, the supervisor attempts to restart the temporary child (which seems counter to the documentation).</div></blockquote><div><br></div><div>That one is fine. The 'one_for_all' is about each process supervised. In that case, the permanent child means 'when I crash, get the one_for_all strategy'. That strategy is applied no matter what the other kinds of children are. On the other hand, in that same supervisor, if the temporary child dies, then the permanent one won't be restarted because that restart strategy means 'in my case, I'm not that vital and you can ignore me dying'. As far as I know, this is expected behaviour.</div><div><br></div><br><blockquote type="cite"><div>In addition the MFA for the child has been replaced with undefined and start attempt crashes, causing the supervisor to continue attempting restarts until it reaches it's restart intensity and shuts down.</div><div><br></div><div>This does not happen if the temporary child is added to the supervisor as part of the supervisor's initial spec. In this case the temporary child is restart but it's MFA is present so it is able to restart it successfully, although it still seems contrary to the documentation that the temporary child is restarted at all, although it depends on which section of the documentation you regard as more important (from the supervisor documentation):</div></blockquote><div><br></div><div>Ah, that one gets trickier. If I take a look at the source, the issue seems to be in the supervisor's 'save_child' function. This one has this comment:</div><div><br></div><div><div>%% Note we do not want to save the parameter list for temporary processes as</div><div>%% they will not be restarted, and hence we do not need this information.</div><div>%% Especially for dynamic children to simple_one_for_one supervisors</div><div>%% it could become very costly as it is not uncommon to spawn</div><div>%% very many such processes.</div><div><br></div></div><div>So the arguments that you pass to 'proc_lib:start_link' get dropped. This check DOES NOT happen when the Child Spec is part of the initial supervisor specification.</div><div><br></div><div>One issue here is that we have two different ways of starting temporary children. The real issue however, seems to be that we assume that temporary processes will NOT be restarted, while the supervisor does attempt to do so.</div><br><blockquote type="cite"><div><br></div><div><br></div><div><li><p><span class="code">one_for_all</span> - if one child process terminates and should be restarted, all other child processes are terminated and then all child processes are restarted.</p><p>This is quite clear: "all child processes are restarted". But later:</p><div><br class="webkit-block-placeholder"></div></li><li><p><span class="code">Restart</span> defines when a terminated child process should be restarted. A <span class="code">permanent</span> child process should always be restarted, a <span class="code">temporary</span> child process should never be restarted and a <span class="code">transient</span> child process should be restarted only if it terminates abnormally, i.e. with another exit reason than <span class="code">normal</span>.</p><div><br></div></li></div></blockquote><blockquote type="cite"><div><div><br class="webkit-block-placeholder"></div></div><div>Again clear but conflicting: "a <span class="code">temporary</span> child process should never be restarted".</div></blockquote><div><br></div>These rules are about the processes themselves as far as I know. It's well possible to expect a transient process to be restarted if it depends on a permanent one that crashed. I'm not sure I get why this would not be the case with a temporary child.<br><blockquote type="cite"><div><br></div><div>I found a similar issue mentioned in a bug report for R14B02, here:</div><div><br></div><div><a href="http://erlang.org/pipermail/erlang-bugs/2011-March/002273.html">http://erlang.org/pipermail/erlang-bugs/2011-March/002273.html</a></div></blockquote><div><br></div>This is an interesting entry because it does suggest a fix that's still there and helps with the issue, but only when the process that crashed was the one we were looking for. The function now looks like this:</div><div><br></div><div><div>del_child(Name, [Ch|Chs]) when Ch#child.name =:= Name, Ch#child.restart_type =:= temporary -></div><div> Chs;</div><div>del_child(Name, [Ch|Chs]) when Ch#child.name =:= Name -></div><div> [Ch#child{pid = undefined} | Chs];</div><div>del_child(Pid, [Ch|Chs]) when Ch#child.pid =:= Pid, Ch#child.restart_type =:= temporary -></div><div> Chs;</div><div>del_child(Pid, [Ch|Chs]) when Ch#child.pid =:= Pid -></div><div> [Ch#child{pid = undefined} | Chs];</div><div>del_child(Name, [Ch|Chs]) -></div><div> [Ch|del_child(Name, Chs)];</div><div>del_child(_, []) -></div><div> [].</div><div><br></div><div>The issue there is that we're still giving temporary children in the list of processes to be restarted if they were not the ones to crash (Name/Pid). Then the sequent calls to 'terminate_children' and 'do_terminate_children' as part of the restart procedure also never clear these up, and the 'start_children' function doesn't do it either.</div><div><br></div><div>In normal cases, it seems the supervisor will use state_del_child as a function to do it, which doesn't happen here (we're using del_child). </div><div>---</div><div><br></div><div>If anyone from the OTP team is looking, it'd be nice to have a confirm/deny on that explanation. I've tried a small patch, but haven't tested it in the large or recompiled anything:</div><div><br></div><div><div>737a738,740</div><div>> terminate_children([Child = #child{restart_type=temporary} | Children], SupName, Res) -></div><div>> do_terminate(Child, SupName),</div><div>> terminate_children(Children, SupName, Res);</div><div><br></div><div>Just adding these lines to R14B03 seems to resolve the issue by avoiding to restart the child in the first place.</div><div>I can make a standard patch and bug submission if required/appreciated.</div></div></div><div><div><br class="Apple-interchange-newline">--</div><div>Fred Hébert</div><div><div><a href="http://www.erlang-solutions.com">http://www.erlang-solutions.com</a></div></div></div></div><div><br></div></body></html>