[erlang-questions] Node not responding to init:stop()

Roger Lipscombe roger@REDACTED
Tue Jul 3 13:48:16 CEST 2018


Note that I can reproduce this consistently, so if there's any
experimentation (tracing, patches, etc.) you want me to try, I should
be able to do that.

On 3 July 2018 at 12:45, Roger Lipscombe <roger@REDACTED> wrote:
> Before I go looking for permission to release the crash dump (it'll
> take at least a couple of hours; I'll ask), let's see what we can
> figure out from here.
>
> There are 44 processes alive. Most of them look fairly innocuous, but,
> in terms of logger, I see:
>
> - A process <0.2130.0> sitting in erlang:apply, issuing
> error_logger:stop, which is a gen_server:call, timeout infinity, to
> <0.927.0>. Stack is logger_server:'-call_h_async/4-fun-0-'/2 + 152 ->
> logger_server:call_h/4 + 96 -> error_logger:removing_handler/1 + 112
> -> error_logger:stop/0 + 72
> - <0.927.0> is 'logger_sup'; it appears to be trying to call
> <0.890.0>, which is 'logger'.
> - <0.890.0> is, apparently, sitting in proc_lib:init_p_do_apply/3
>
> I suspect that logger_sup has failed in supervisor:do_terminate, and
> is attempting to report an error, via logger (which is problematic, I
> guess). I see the following on the logger_sup stack:
>
> #{level => error,
>   meta =>
>       #{domain => [otp,sasl],
>         error_logger => #{tag => error_report,type => supervisor_report},
>         file => "supervisor.erl",
>         gl => <0.894.0>,
>         line => 838,
>         logger_formatter => #{title => "SUPERVISOR REPORT"},
>         mfa => {supervisor,do_terminate,2},
>         pid => <0.927.0>,
>         report_cb => fun logger:format_otp_report/1,time => 1530616282326266},
>   msg =>
>       {report,#{label => {supervisor,shutdown_error},
>                 report =>
>                     [{supervisor,{local,logger_sup}},
>                      {errorContext,shutdown_error},
>                      {reason,killed},
>                      {offender,[{pid,<0.929.0>},
>                                 {id,default},
>                                 {mfargs,{logger_std_h,start_link,undefined}},
>                                 {restart_type,temporary},
>                                 {shutdown,2000},
>                                 {child_type,worker}]}]}}}
>
> At this point, logger_sup has zero links, is monitoring 'logger' and
> is monitored by kernel_sup and the process that's calling
> error_logger:stop.
>
> For a *running* node, I get:
>
> 1> supervisor:which_children(logger_sup).
> [{error_logger,<0.1023.0>,worker,dynamic},
>  {default,<0.929.0>,worker,[logger_std_h]}]
>
> Assuming that the low process IDs are relatively stable, that
> <0.929.0> for the default logger correlates with the offender in the
> ?reportError above.
>
> I *believe* that lager's running independently. That is: I've done
> nothing deliberate to connect the two. I'm using lager 3.6.3.
>
> Thanks,
> Roger.
>
> On 3 July 2018 at 11:32, Siri Hansen <erlangsiri@REDACTED> wrote:
>> Hi Roger,
>>
>> there's been a few changes in the shutdown procedure (init.erl) with the
>> introduction of Logger in OTP-21. There are also some changes in the
>> supervision tree of the Kernel application related to Logger. I don't know
>> if this has anything to do with your problem, but the chances exist, of
>> course.
>>
>> As you say, Kernel is the last application to be taken down (it's the last
>> element in the #state.running of application_controller), and the state you
>> see is probably stale in the sense that all applications listed before
>> kernel in the 'running' list should already have been terminated. From the
>> crash dump - can you see how many processes are still alive? Are they all
>> part of kernel, or are there more applications that did not completely
>> terminate?
>>
>> And if we are to look closer at the logger track - do you know which logger
>> handlers were active? Are there any children under the logger_sup
>> supervisor, and if so, what are their state? You mention lager above, do you
>> know how lager and logger are connected (is lager a handler under logger, or
>> does it run independently)?
>>
>> If you see anything suspicious related to logger, please feel free to send
>> me the crash dump off-list, and I'll see if I can figure out anything more.
>>
>> Regards
>> /siri
>>
>>
>>
>> Den man. 2. jul. 2018 kl. 18:24 skrev Roger Lipscombe
>> <roger@REDACTED>:
>>>
>>> Right, so I've done some digging. I got a crash dump and by looking at
>>> the stack trace for application_controller, I can see that it's
>>> hanging in application_controller:terminate, presumably while waiting
>>> for one of the processes to stop. The pid given is <0.893.0>. Looking
>>> a bit further up the stack trace gets me the process state (possibly
>>> stale?), which contains a bunch of applications and pids. The pid it's
>>> hanging on is 'kernel', which I'd expect to be the last (?)
>>> application to be shut down. The process state (assuming it's not
>>> stale) has a load of other applications in there with pids. There are
>>> a bunch of ours, but also others, such as runtime_tools, cowboy,
>>> amqp_client, hackney, lager, etc.
>>>
>>> Is any of this useful in finding out why it's hanging?
>>>
>>> On 2 July 2018 at 16:54, Roger Lipscombe <roger@REDACTED> wrote:
>>> > This sounds similar to
>>> > http://erlang.org/pipermail/erlang-questions/2012-December/071223.html,
>>> > but it's not quite the same, as far as I can tell.
>>> >
>>> > I've noticed that, at some point since upgrading from OTP-20.3 to
>>> > OTP-21.0 (along with the necessary dependency updates), my Erlang
>>> > nodes are no longer stopping at the end of our system test run.
>>> >
>>> > The nodes are orchestrated by having 'erlexec' run a bash script which
>>> > uses (effectively) 'foo/bin/foo foreground &'. I'm relying on erlexec
>>> > killing the bash script and that killing the nodes. This works fine
>>> > when the nodes are using OTP-20.3, but not with OTP-21.0.
>>> >
>>> > If I connect to the node, I can issue 'init:stop()', and it returns
>>> > ok, but nothing happens. If I use 'application:which_applications()',
>>> > I get a timeout.
>>> >
>>> > Unlike the linked discussion, I _can_ repeatedly connect a remote
>>> > shell (using erl -remsh), but I have to resort to erlang:halt() to
>>> > stop the node. Since my system test environment relies on orderly
>>> > process group teardown to stop the nodes, that's not useful.
>>> >
>>> > As I understand it, process group teardown results in SIGTERM, which
>>> > results in a call to init:stop, which should stop the node. It isn't.
>>> >
>>> > How do I figure out why init:stop() isn't working?
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions



More information about the erlang-questions mailing list