[erlang-questions] Incorrect supervisor behaviour and crash when mixing temporary and permanent children

Sam Bobroff sam@REDACTED
Wed Aug 3 03:13:03 CEST 2011


Hello Erlangers,

I've recently tracked down a bug in some code that seems to be caused by a
problem with the supervisor module. What seems to be happening is that when
a temporary child has been added to a one_for_all supervisor using
start_child, if a permanent child of that supervisor exits then if that
permanent child exits, the supervisor attempts to restart the temporary
child (which seems counter to the documentation). In addition the MFA for
the child has been replaced with undefined and start attempt crashes,
causing the supervisor to continue attempting restarts until it reaches it's
restart intensity and shuts down.

This does not happen if the temporary child is added to the supervisor as
part of the supervisor's initial spec. In this case the temporary child is
restart but it's MFA is present so it is able to restart it successfully,
although it still seems contrary to the documentation that the temporary
child is restarted at all, although it depends on which section of the
documentation you regard as more important (from the supervisor
documentation):

one_for_all - if one child process terminates and should be restarted, all
other child processes are terminated and then all child processes are
restarted.

This is quite clear: "all child processes are restarted". But later:

Restart defines when a terminated child process should be restarted. A
permanent child process should always be restarted, a temporary child
process should never be restarted and a transient child process should be
restarted only if it terminates abnormally, i.e. with another exit reason
than normal.

Again clear but conflicting: "a temporary child process should never be
restarted".

I found a similar issue mentioned in a bug report for R14B02, here:

http://erlang.org/pipermail/erlang-bugs/2011-March/002273.html

But in that case it seemed necessary to call restart_child() which makes it
much less of a problem.

I'm testing on R14B03 and I've produced a small piece of code to replicate
the problem:

--- begin ---

-module(bug).
-behaviour(supervisor).
-export([test_one/0, test_two/0, spec/2, init/1, main/1]).

test_one() ->
application:start(sasl),
supervisor:start_link({local, sup}, ?MODULE, [spec(foo, permanent)]),
supervisor:start_child(sup, spec(bar, temporary)),
foo ! die.

test_two() ->
application:start(sasl),
supervisor:start_link({local, sup}, ?MODULE, [spec(foo, permanent),
spec(bar, temporary)]),
foo ! die.

spec(Name, Type) ->
{ Name,
{proc_lib, start_link, [?MODULE, main, [Name]]},
Type,
3000,
worker,
[bug]
}.

init(Children) ->
{ok, {{one_for_all, 3, 10000}, Children}}.

main(Name) ->
register(Name, self()),
proc_lib:init_ack({ok, self()}),
receive
die -> ok
end.

--- end ---

Running test_one from the shell produces this output on my system:

$ erl
Erlang R14B03 (erts-5.8.4) [source] [smp:4:4] [rq:4] [async-threads:0]
[hipe] [kernel-poll:false]

Eshell V5.8.4  (abort with ^G)
1> bug:test_one().
** exception exit: shutdown
2>

[snip SASL startup]

=PROGRESS REPORT==== 3-Aug-2011::11:06:40 ===
          supervisor: {local,sup}
             started: [{pid,<0.44.0>},
                       {name,foo},
                       {mfargs,{proc_lib,start_link,[bug,main,[foo]]}},
                       {restart_type,permanent},
                       {shutdown,3000},
                       {child_type,worker}]

=PROGRESS REPORT==== 3-Aug-2011::11:06:40 ===
          supervisor: {local,sup}
             started: [{pid,<0.45.0>},
                       {name,bar},
                       {mfargs,{proc_lib,start_link,[bug,main,[bar]]}},
                       {restart_type,temporary},
                       {shutdown,3000},
                       {child_type,worker}]

=SUPERVISOR REPORT==== 3-Aug-2011::11:06:40 ===
     Supervisor: {local,sup}
     Context:    child_terminated
     Reason:     normal
     Offender:   [{pid,<0.44.0>},
                  {name,foo},
                  {mfargs,{proc_lib,start_link,[bug,main,[foo]]}},
                  {restart_type,permanent},
                  {shutdown,3000},
                  {child_type,worker}]

=PROGRESS REPORT==== 3-Aug-2011::11:06:40 ===
          supervisor: {local,sup}
             started: [{pid,<0.46.0>},
                       {name,foo},
                       {mfargs,{proc_lib,start_link,[bug,main,[foo]]}},
                       {restart_type,permanent},
                       {shutdown,3000},
                       {child_type,worker}]

=SUPERVISOR REPORT==== 3-Aug-2011::11:06:40 ===
     Supervisor: {local,sup}
     Context:    start_error
     Reason:     {'EXIT',
                     {badarg,
                         [{erlang,apply,[proc_lib,start_link,undefined]},
                          {supervisor,do_start_child,2},
                          {supervisor,start_children,3},
                          {supervisor,restart,3},
                          {supervisor,handle_info,2},
                          {gen_server,handle_msg,5},
                          {proc_lib,init_p_do_apply,3}]}}
     Offender:   [{pid,undefined},
                  {name,bar},
                  {mfargs,{proc_lib,start_link,undefined}},
                  {restart_type,temporary},
                  {shutdown,3000},
                  {child_type,worker}]

[supervisor loops until it shuts down]

Running test_two() shows it restarting the temporary child:

2> bug:test_two().
die

=PROGRESS REPORT==== 3-Aug-2011::11:09:23 ===
          supervisor: {local,sup}
             started: [{pid,<0.52.0>},
                       {name,foo},
                       {mfargs,{proc_lib,start_link,[bug,main,[foo]]}},
                       {restart_type,permanent},
                       {shutdown,3000},
                       {child_type,worker}]
3>
=PROGRESS REPORT==== 3-Aug-2011::11:09:23 ===
          supervisor: {local,sup}
             started: [{pid,<0.53.0>},
                       {name,bar},
                       {mfargs,{proc_lib,start_link,[bug,main,[bar]]}},
                       {restart_type,temporary},
                       {shutdown,3000},
                       {child_type,worker}]

=SUPERVISOR REPORT==== 3-Aug-2011::11:09:23 ===
     Supervisor: {local,sup}
     Context:    child_terminated
     Reason:     normal
     Offender:   [{pid,<0.52.0>},
                  {name,foo},
                  {mfargs,{proc_lib,start_link,[bug,main,[foo]]}},
                  {restart_type,permanent},
                  {shutdown,3000},
                  {child_type,worker}]


=PROGRESS REPORT==== 3-Aug-2011::11:09:23 ===
          supervisor: {local,sup}
             started: [{pid,<0.54.0>},
                       {name,foo},
                       {mfargs,{proc_lib,start_link,[bug,main,[foo]]}},
                       {restart_type,permanent},
                       {shutdown,3000},
                       {child_type,worker}]

=PROGRESS REPORT==== 3-Aug-2011::11:09:23 ===
          supervisor: {local,sup}
             started: [{pid,<0.55.0>},
                       {name,bar},
                       {mfargs,{proc_lib,start_link,[bug,main,[bar]]}},
                       {restart_type,temporary},
                       {shutdown,3000},
                       {child_type,worker}]


Am I doing something wrong or is this actually a bug (or bugs)?

Peace,
Sam.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20110803/a8daeec2/attachment.htm>


More information about the erlang-questions mailing list