[erlang-questions] Supervisor child stuck in 'restarting' state

Mon Oct 13 22:26:24 CEST 2014

Bump?

Bond of supervisor+timer with rather fresh code fails to build
fault-tolerant system under unclear circumstances.

Was there bug in erts causing message loss or timer callback failing to be
called?
May user code which does not kill the timer server affect a running
supervisor in this way?
Which part of system should I watch more carefully to investigate a problem?

On Wed, Oct 8, 2014 at 12:52 AM, Danil Zagoskin <z@REDACTED> wrote:

> Hello!
>
> We have an application (well, it's some patched old ejabberd fork) running
> on OTP R16B (no digits after "B").
>
> On one of our clusters sometimes appears a strange problem with supervisor
> — a child does not restart after crash (one_for_one strategy):
>
> Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:16:16] [async-threads:10]
> [hipe] [kernel-poll:false]
> (ejabberd@REDACTED)1> f(State), f(Children), State = hd([S || {data, Data}
> <- lists:nth(5, element(4, sys:get_status(ejabberd_listeners))), {"State",
> S} <- Data]), Children = element(4, State), lists:keyfind(5223, 3,
> Children).
> {child,{restarting,<0.17921.2571>},
>        5223,
>        {ejabberd_listener,start,
>                           [5223,ejabberd_c2s,
>                            [{access,c2s},
>                             {max_stanza_size,262144},
>                             {sasl_mechs,[]},
>                             {non_sasl_meths,[]},
>                             zlib,tls, ....]]},
>        permanent,brutal_kill,worker,
>        [ejabberd_listener]}
>
> Inspecting supervisor code gave a way to actually restart the child —
> supervisor:try_again_restart/2 works well when called from REPL.
> As supervisor:restart/2 code says, try_again_restart is scheduled
> using timer:apply_after after a failed start attempt (just where
> 'restarting' tag appears). So it seems like lost event in timer server.
>
> Git showed no changes in this part of supervisor code since R16B to
> current maint branch.
> There are no considerable changes in timer module either.
>
> Also it's quite strange to me that other clusters (with other mods loaded
> but with a similar config of this supervisor) do not suffer of this problem.
>
>
> How do I avoid such problem or how can I get more information on it?
>
> --
> Danil Zagoskin | z@REDACTED
>

-- 
Danil Zagoskin | z@REDACTED
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141014/82d66fb2/attachment.htm>