[erlang-questions] Supervisor child stuck in 'restarting' state

Tue Oct 7 22:52:42 CEST 2014

Hello!

We have an application (well, it's some patched old ejabberd fork) running
on OTP R16B (no digits after "B").

On one of our clusters sometimes appears a strange problem with supervisor
— a child does not restart after crash (one_for_one strategy):

Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:16:16] [async-threads:10]
[hipe] [kernel-poll:false]
(ejabberd@REDACTED)1> f(State), f(Children), State = hd([S || {data, Data} <-
lists:nth(5, element(4, sys:get_status(ejabberd_listeners))), {"State", S}
<- Data]), Children = element(4, State), lists:keyfind(5223, 3, Children).
{child,{restarting,<0.17921.2571>},
       5223,
       {ejabberd_listener,start,
                          [5223,ejabberd_c2s,
                           [{access,c2s},
                            {max_stanza_size,262144},
                            {sasl_mechs,[]},
                            {non_sasl_meths,[]},
                            zlib,tls, ....]]},
       permanent,brutal_kill,worker,
       [ejabberd_listener]}

Inspecting supervisor code gave a way to actually restart the child —
supervisor:try_again_restart/2 works well when called from REPL.
As supervisor:restart/2 code says, try_again_restart is scheduled
using timer:apply_after after a failed start attempt (just where
'restarting' tag appears). So it seems like lost event in timer server.

Git showed no changes in this part of supervisor code since R16B to current
maint branch.
There are no considerable changes in timer module either.

Also it's quite strange to me that other clusters (with other mods loaded
but with a similar config of this supervisor) do not suffer of this problem.

How do I avoid such problem or how can I get more information on it?

-- 
Danil Zagoskin | z@REDACTED
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141008/86e96659/attachment.htm>