[erlang-questions] supervisor process not responding to messages ('EXIT', which_children, etc)
Thu Apr 29 05:54:37 CEST 2010
Thanks a lot for your help working through this Scott.
First off, I need to move any work that could take more than a few ms
out of the init/1 function. This makes sense from an initial startup as
well as restart case.
Would the best way be to move any long-running tasks to handle_cast
with a gen_server:cast, or handle_event with a gen_fsm:send_all_state_event
in init/1 depending on worker type?
Second, the initial start order and synchronous start makes sense. I am
relying on this behavior myself.
Third, I'm not sure I understand the restart case. If I was using the
restart strategy, I would expect all children after the failed child
to be killed and
restarted synchronously and in order. However, I am using the one_for_one
restart strategy, so I expect the supervisor to restart only the
failed child without
restarting any other children (but still synchronously).
An example of my structure for conversation, which seems like a common pattern
child_sup1 child_sup_2 child_sup_3
worker1 worker2 w1 w2 w3 w1 w2 w3
Say enough workers under child_sup_2 die in a short time to exceed the
restart limit. child_sup_2 then exits as expected. app_supervisor then
restarts child_sup_2 as expected. child_sup_2 takes too long to restart,
so app_supervisor kills it during init, also terminating any workers that
At this point, what should happen? Without digging into documentation
right now, I would expect either app_supervisor to immediately exit, or
to continue trying to restart child_sup_2 until it succeeds or reaches the
max restart count and exits.
What I have observed is that app_supervisor is deadlocked in
It no longer responds to any messages: 'EXIT' signals from other children,
which_children messages from supervisor:which_children, etc. I am pretty sure
that this is not intended behavior...
To summarize, I can fix the problem by moving any long-running tasks out
of init/1, and should do this regardless, but I would expect OTP to do something
other than hang. Thoughts?
On Wed, Apr 28, 2010 at 11:12 AM, Scott Lystig Fritchie
> gs> The results of 'erlang:process_info(Pid, backtrace)' below as you
> gs> suggested. It seems that the supervisor was trying to restart a
> gs> child, the child took too long to start so it was killed, but then
> gs> the supervisor hung. At this point, I can have the child start
> gs> faster, but why is the supervisor hung?
> The supervisor's behavior must be deterministic, so it starts children
> synchronously. (More on that in a little bit.)
> From the supervisor:start_link() manual:
> The created supervisor process calls Module:init/1 to find out about
> restart strategy, maximum restart frequency and child processes. To
> ensure a synchronized start-up procedure, start_link/2,3 does not
> return until Module:init/1 has returned and all child processes have
> been started.
> You have to read between the lines to see that the above paragraph
> applies to you. A child's init func is handled synchronously. During
> the supervisor's start, Module:init/1 won't return until all the
> children are started. All restart strategies require that children be
> started in the order that they are specified.
> App developers rely on this child start order to preserve
> inter-process/service dependencies. If child processes were started in
> random order, application dependencies could be broken, and the app can
> run incorrectly or, perhaps worse yet, even fail to start at all.
> In your case, when a single worker has died and requires restarting, the
> supervisor is using the same method synchronous method of restarting the
> child. If a child can't start in a predictable (and hopefully very
> short) amount of time, then the variable-time work needs to be done
> after the child's init function returns. The strategies mentioned a few
> days ago in the "Subject: testing asynchronous code" thread can be very
More information about the erlang-questions