[erlang-questions] Supervisor child stuck in 'restarting' state

Danil Zagoskin z@REDACTED
Mon Oct 13 22:26:24 CEST 2014


Bond of supervisor+timer with rather fresh code fails to build
fault-tolerant system under unclear circumstances.

Was there bug in erts causing message loss or timer callback failing to be
May user code which does not kill the timer server affect a running
supervisor in this way?
Which part of system should I watch more carefully to investigate a problem?

On Wed, Oct 8, 2014 at 12:52 AM, Danil Zagoskin <z@REDACTED> wrote:

> Hello!
> We have an application (well, it's some patched old ejabberd fork) running
> on OTP R16B (no digits after "B").
> On one of our clusters sometimes appears a strange problem with supervisor
> — a child does not restart after crash (one_for_one strategy):
> Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:16:16] [async-threads:10]
> [hipe] [kernel-poll:false]
> (ejabberd@REDACTED)1> f(State), f(Children), State = hd([S || {data, Data}
> <- lists:nth(5, element(4, sys:get_status(ejabberd_listeners))), {"State",
> S} <- Data]), Children = element(4, State), lists:keyfind(5223, 3,
> Children).
> {child,{restarting,<0.17921.2571>},
>        5223,
>        {ejabberd_listener,start,
>                           [5223,ejabberd_c2s,
>                            [{access,c2s},
>                             {max_stanza_size,262144},
>                             {sasl_mechs,[]},
>                             {non_sasl_meths,[]},
>                             zlib,tls, ....]]},
>        permanent,brutal_kill,worker,
>        [ejabberd_listener]}
> Inspecting supervisor code gave a way to actually restart the child —
> supervisor:try_again_restart/2 works well when called from REPL.
> As supervisor:restart/2 code says, try_again_restart is scheduled
> using timer:apply_after after a failed start attempt (just where
> 'restarting' tag appears). So it seems like lost event in timer server.
> Git showed no changes in this part of supervisor code since R16B to
> current maint branch.
> There are no considerable changes in timer module either.
> Also it's quite strange to me that other clusters (with other mods loaded
> but with a similar config of this supervisor) do not suffer of this problem.
> How do I avoid such problem or how can I get more information on it?
> --
> Danil Zagoskin | z@REDACTED

Danil Zagoskin | z@REDACTED
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141014/82d66fb2/attachment.htm>

More information about the erlang-questions mailing list