[erlang-questions] supervisor process not responding to messages ('EXIT', which_children, etc)

Thu Apr 29 05:54:37 CEST 2010

Thanks a lot for your help working through this Scott.

First off, I need to move any work that could take more than a few ms
out of the init/1 function.  This makes sense from an initial startup as
well as restart case.

Would the best way be to move any long-running tasks to handle_cast
with a gen_server:cast, or handle_event with a gen_fsm:send_all_state_event
in init/1 depending on worker type?

Second, the initial start order and synchronous start makes sense.  I am
relying on this behavior myself.

Third, I'm not sure I understand the restart case.  If I was using the
rest_for_one
restart strategy, I would expect all children after the failed child
to be killed and
restarted synchronously and in order.  However, I am using the one_for_one
restart strategy, so I expect the supervisor to restart only the
failed child without
restarting any other children (but still synchronously).

An example of my structure for conversation, which seems like a common pattern

                                          app_supervisor

          child_sup1           child_sup_2               child_sup_3

    worker1   worker2       w1   w2   w3              w1   w2   w3

Say enough workers under child_sup_2 die in a short time to exceed the
restart limit.  child_sup_2 then exits as expected.  app_supervisor then
restarts child_sup_2 as expected.  child_sup_2 takes too long to restart,
so app_supervisor kills it during init, also terminating any workers that
had started.

At this point, what should happen?  Without digging into documentation
right now, I would expect either app_supervisor to immediately exit, or
to continue trying to restart child_sup_2 until it succeeds or reaches the
max restart count and exits.

What I have observed is that app_supervisor is deadlocked in
proc_lib:sync_wait/2.
It no longer responds to any messages: 'EXIT' signals from other children,
which_children messages from supervisor:which_children, etc.  I am pretty sure
that this is not intended behavior...

To summarize, I can fix the problem by moving any long-running tasks out
of init/1, and should do this regardless, but I would expect OTP to do something
other than hang.  Thoughts?

-Garret Smith

On Wed, Apr 28, 2010 at 11:12 AM, Scott Lystig Fritchie
<fritchie@REDACTED> wrote:
> gs> The results of 'erlang:process_info(Pid, backtrace)' below as you
> gs> suggested.  It seems that the supervisor was trying to restart a
> gs> child, the child took too long to start so it was killed, but then
> gs> the supervisor hung.  At this point, I can have the child start
> gs> faster, but why is the supervisor hung?
>
> The supervisor's behavior must be deterministic, so it starts children
> synchronously.  (More on that in a little bit.)
>
> From the supervisor:start_link() manual:
>
>    The created supervisor process calls Module:init/1 to find out about
>    restart strategy, maximum restart frequency and child processes. To
>    ensure a synchronized start-up procedure, start_link/2,3 does not
>    return until Module:init/1 has returned and all child processes have
>    been started.
>
> You have to read between the lines to see that the above paragraph
> applies to you.  A child's init func is handled synchronously.  During
> the supervisor's start, Module:init/1 won't return until all the
> children are started.  All restart strategies require that children be
> started in the order that they are specified.
>
> App developers rely on this child start order to preserve
> inter-process/service dependencies.  If child processes were started in
> random order, application dependencies could be broken, and the app can
> run incorrectly or, perhaps worse yet, even fail to start at all.
>
> In your case, when a single worker has died and requires restarting, the
> supervisor is using the same method synchronous method of restarting the
> child.  If a child can't start in a predictable (and hopefully very
> short) amount of time, then the variable-time work needs to be done
> after the child's init function returns.  The strategies mentioned a few
> days ago in the "Subject: testing asynchronous code" thread can be very
> useful.
>
> -Scott
>