[erlang-questions] Node not responding to init:stop()

Roger Lipscombe roger@REDACTED
Tue Jul 3 13:45:30 CEST 2018


Before I go looking for permission to release the crash dump (it'll
take at least a couple of hours; I'll ask), let's see what we can
figure out from here.

There are 44 processes alive. Most of them look fairly innocuous, but,
in terms of logger, I see:

- A process <0.2130.0> sitting in erlang:apply, issuing
error_logger:stop, which is a gen_server:call, timeout infinity, to
<0.927.0>. Stack is logger_server:'-call_h_async/4-fun-0-'/2 + 152 ->
logger_server:call_h/4 + 96 -> error_logger:removing_handler/1 + 112
-> error_logger:stop/0 + 72
- <0.927.0> is 'logger_sup'; it appears to be trying to call
<0.890.0>, which is 'logger'.
- <0.890.0> is, apparently, sitting in proc_lib:init_p_do_apply/3

I suspect that logger_sup has failed in supervisor:do_terminate, and
is attempting to report an error, via logger (which is problematic, I
guess). I see the following on the logger_sup stack:

#{level => error,
  meta =>
      #{domain => [otp,sasl],
        error_logger => #{tag => error_report,type => supervisor_report},
        file => "supervisor.erl",
        gl => <0.894.0>,
        line => 838,
        logger_formatter => #{title => "SUPERVISOR REPORT"},
        mfa => {supervisor,do_terminate,2},
        pid => <0.927.0>,
        report_cb => fun logger:format_otp_report/1,time => 1530616282326266},
  msg =>
      {report,#{label => {supervisor,shutdown_error},
                report =>
                    [{supervisor,{local,logger_sup}},
                     {errorContext,shutdown_error},
                     {reason,killed},
                     {offender,[{pid,<0.929.0>},
                                {id,default},
                                {mfargs,{logger_std_h,start_link,undefined}},
                                {restart_type,temporary},
                                {shutdown,2000},
                                {child_type,worker}]}]}}}

At this point, logger_sup has zero links, is monitoring 'logger' and
is monitored by kernel_sup and the process that's calling
error_logger:stop.

For a *running* node, I get:

1> supervisor:which_children(logger_sup).
[{error_logger,<0.1023.0>,worker,dynamic},
 {default,<0.929.0>,worker,[logger_std_h]}]

Assuming that the low process IDs are relatively stable, that
<0.929.0> for the default logger correlates with the offender in the
?reportError above.

I *believe* that lager's running independently. That is: I've done
nothing deliberate to connect the two. I'm using lager 3.6.3.

Thanks,
Roger.

On 3 July 2018 at 11:32, Siri Hansen <erlangsiri@REDACTED> wrote:
> Hi Roger,
>
> there's been a few changes in the shutdown procedure (init.erl) with the
> introduction of Logger in OTP-21. There are also some changes in the
> supervision tree of the Kernel application related to Logger. I don't know
> if this has anything to do with your problem, but the chances exist, of
> course.
>
> As you say, Kernel is the last application to be taken down (it's the last
> element in the #state.running of application_controller), and the state you
> see is probably stale in the sense that all applications listed before
> kernel in the 'running' list should already have been terminated. From the
> crash dump - can you see how many processes are still alive? Are they all
> part of kernel, or are there more applications that did not completely
> terminate?
>
> And if we are to look closer at the logger track - do you know which logger
> handlers were active? Are there any children under the logger_sup
> supervisor, and if so, what are their state? You mention lager above, do you
> know how lager and logger are connected (is lager a handler under logger, or
> does it run independently)?
>
> If you see anything suspicious related to logger, please feel free to send
> me the crash dump off-list, and I'll see if I can figure out anything more.
>
> Regards
> /siri
>
>
>
> Den man. 2. jul. 2018 kl. 18:24 skrev Roger Lipscombe
> <roger@REDACTED>:
>>
>> Right, so I've done some digging. I got a crash dump and by looking at
>> the stack trace for application_controller, I can see that it's
>> hanging in application_controller:terminate, presumably while waiting
>> for one of the processes to stop. The pid given is <0.893.0>. Looking
>> a bit further up the stack trace gets me the process state (possibly
>> stale?), which contains a bunch of applications and pids. The pid it's
>> hanging on is 'kernel', which I'd expect to be the last (?)
>> application to be shut down. The process state (assuming it's not
>> stale) has a load of other applications in there with pids. There are
>> a bunch of ours, but also others, such as runtime_tools, cowboy,
>> amqp_client, hackney, lager, etc.
>>
>> Is any of this useful in finding out why it's hanging?
>>
>> On 2 July 2018 at 16:54, Roger Lipscombe <roger@REDACTED> wrote:
>> > This sounds similar to
>> > http://erlang.org/pipermail/erlang-questions/2012-December/071223.html,
>> > but it's not quite the same, as far as I can tell.
>> >
>> > I've noticed that, at some point since upgrading from OTP-20.3 to
>> > OTP-21.0 (along with the necessary dependency updates), my Erlang
>> > nodes are no longer stopping at the end of our system test run.
>> >
>> > The nodes are orchestrated by having 'erlexec' run a bash script which
>> > uses (effectively) 'foo/bin/foo foreground &'. I'm relying on erlexec
>> > killing the bash script and that killing the nodes. This works fine
>> > when the nodes are using OTP-20.3, but not with OTP-21.0.
>> >
>> > If I connect to the node, I can issue 'init:stop()', and it returns
>> > ok, but nothing happens. If I use 'application:which_applications()',
>> > I get a timeout.
>> >
>> > Unlike the linked discussion, I _can_ repeatedly connect a remote
>> > shell (using erl -remsh), but I have to resort to erlang:halt() to
>> > stop the node. Since my system test environment relies on orderly
>> > process group teardown to stop the nodes, that's not useful.
>> >
>> > As I understand it, process group teardown results in SIGTERM, which
>> > results in a call to init:stop, which should stop the node. It isn't.
>> >
>> > How do I figure out why init:stop() isn't working?
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions



More information about the erlang-questions mailing list