[erlang-questions] Supervisor child stuck in 'restarting' state

Thu Oct 30 12:49:17 CET 2014

Upgraded to OTP 17.3, still having the same issue:

12> lists:keyfind(5223, 3, element(4, sys:get_state(ejabberd_listeners))).
#child{pid = {restarting,<0.2096.2241>},
       name = 5223,
       mfargs = {ejabberd_listener,start,
                                   [5223,ejabberd_c2s,
                                    [...]]},
       restart_type = permanent,shutdown = brutal_kill,
       child_type = worker,
       modules = [ejabberd_listener]}
13> supervisor:try_again_restart(ejabberd_listeners, 5223).
ok
14> lists:keyfind(5223, 3, element(4, sys:get_state(ejabberd_listeners))).
#child{pid = <0.17553.2059>,name = 5223,
       mfargs = {ejabberd_listener,start,
                                   [5223,ejabberd_c2s,
                                    [...]]},
       restart_type = permanent,shutdown = brutal_kill,
       child_type = worker,
       modules = [ejabberd_listener]}

Any ideas on what to do when supervisor does not supervise?

On Tue, Oct 14, 2014 at 12:26 AM, Danil Zagoskin <z@REDACTED> wrote:

> Bump?
>
> Bond of supervisor+timer with rather fresh code fails to build
> fault-tolerant system under unclear circumstances.
>
> Was there bug in erts causing message loss or timer callback failing to be
> called?
> May user code which does not kill the timer server affect a running
> supervisor in this way?
> Which part of system should I watch more carefully to investigate a
> problem?
>
>
> On Wed, Oct 8, 2014 at 12:52 AM, Danil Zagoskin <z@REDACTED> wrote:
>
>> Hello!
>>
>> We have an application (well, it's some patched old ejabberd fork)
>> running on OTP R16B (no digits after "B").
>>
>> On one of our clusters sometimes appears a strange problem with
>> supervisor — a child does not restart after crash (one_for_one strategy):
>>
>> Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:16:16]
>> [async-threads:10] [hipe] [kernel-poll:false]
>> (ejabberd@REDACTED)1> f(State), f(Children), State = hd([S || {data, Data}
>> <- lists:nth(5, element(4, sys:get_status(ejabberd_listeners))), {"State",
>> S} <- Data]), Children = element(4, State), lists:keyfind(5223, 3,
>> Children).
>> {child,{restarting,<0.17921.2571>},
>>        5223,
>>        {ejabberd_listener,start,
>>                           [5223,ejabberd_c2s,
>>                            [{access,c2s},
>>                             {max_stanza_size,262144},
>>                             {sasl_mechs,[]},
>>                             {non_sasl_meths,[]},
>>                             zlib,tls, ....]]},
>>        permanent,brutal_kill,worker,
>>        [ejabberd_listener]}
>>
>> Inspecting supervisor code gave a way to actually restart the child —
>> supervisor:try_again_restart/2 works well when called from REPL.
>> As supervisor:restart/2 code says, try_again_restart is scheduled
>> using timer:apply_after after a failed start attempt (just where
>> 'restarting' tag appears). So it seems like lost event in timer server.
>>
>> Git showed no changes in this part of supervisor code since R16B to
>> current maint branch.
>> There are no considerable changes in timer module either.
>>
>> Also it's quite strange to me that other clusters (with other mods loaded
>> but with a similar config of this supervisor) do not suffer of this problem.
>>
>>
>> How do I avoid such problem or how can I get more information on it?
>>
>> --
>> Danil Zagoskin | z@REDACTED
>>
>
>
>
> --
> Danil Zagoskin | z@REDACTED
>

-- 
Danil Zagoskin | z@REDACTED
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141030/fb0a5ac7/attachment.htm>