[erlang-questions] Node not responding to init:stop()

Fri Jul 13 14:22:50 CEST 2018

For information: After some off-list trouble shooting we found the reason
for this issue to be a dead-lock in error_logger and the logger_sup
supervisor. A correction is merged to maint and master today.
Regards
/siri

Den tir. 3. jul. 2018 kl. 13:48 skrev Roger Lipscombe <
roger@REDACTED>:

> Note that I can reproduce this consistently, so if there's any
> experimentation (tracing, patches, etc.) you want me to try, I should
> be able to do that.
>
> On 3 July 2018 at 12:45, Roger Lipscombe <roger@REDACTED> wrote:
> > Before I go looking for permission to release the crash dump (it'll
> > take at least a couple of hours; I'll ask), let's see what we can
> > figure out from here.
> >
> > There are 44 processes alive. Most of them look fairly innocuous, but,
> > in terms of logger, I see:
> >
> > - A process <0.2130.0> sitting in erlang:apply, issuing
> > error_logger:stop, which is a gen_server:call, timeout infinity, to
> > <0.927.0>. Stack is logger_server:'-call_h_async/4-fun-0-'/2 + 152 ->
> > logger_server:call_h/4 + 96 -> error_logger:removing_handler/1 + 112
> > -> error_logger:stop/0 + 72
> > - <0.927.0> is 'logger_sup'; it appears to be trying to call
> > <0.890.0>, which is 'logger'.
> > - <0.890.0> is, apparently, sitting in proc_lib:init_p_do_apply/3
> >
> > I suspect that logger_sup has failed in supervisor:do_terminate, and
> > is attempting to report an error, via logger (which is problematic, I
> > guess). I see the following on the logger_sup stack:
> >
> > #{level => error,
> >   meta =>
> >       #{domain => [otp,sasl],
> >         error_logger => #{tag => error_report,type => supervisor_report},
> >         file => "supervisor.erl",
> >         gl => <0.894.0>,
> >         line => 838,
> >         logger_formatter => #{title => "SUPERVISOR REPORT"},
> >         mfa => {supervisor,do_terminate,2},
> >         pid => <0.927.0>,
> >         report_cb => fun logger:format_otp_report/1,time =>
> 1530616282326266},
> >   msg =>
> >       {report,#{label => {supervisor,shutdown_error},
> >                 report =>
> >                     [{supervisor,{local,logger_sup}},
> >                      {errorContext,shutdown_error},
> >                      {reason,killed},
> >                      {offender,[{pid,<0.929.0>},
> >                                 {id,default},
> >
>  {mfargs,{logger_std_h,start_link,undefined}},
> >                                 {restart_type,temporary},
> >                                 {shutdown,2000},
> >                                 {child_type,worker}]}]}}}
> >
> > At this point, logger_sup has zero links, is monitoring 'logger' and
> > is monitored by kernel_sup and the process that's calling
> > error_logger:stop.
> >
> > For a *running* node, I get:
> >
> > 1> supervisor:which_children(logger_sup).
> > [{error_logger,<0.1023.0>,worker,dynamic},
> >  {default,<0.929.0>,worker,[logger_std_h]}]
> >
> > Assuming that the low process IDs are relatively stable, that
> > <0.929.0> for the default logger correlates with the offender in the
> > ?reportError above.
> >
> > I *believe* that lager's running independently. That is: I've done
> > nothing deliberate to connect the two. I'm using lager 3.6.3.
> >
> > Thanks,
> > Roger.
> >
> > On 3 July 2018 at 11:32, Siri Hansen <erlangsiri@REDACTED> wrote:
> >> Hi Roger,
> >>
> >> there's been a few changes in the shutdown procedure (init.erl) with the
> >> introduction of Logger in OTP-21. There are also some changes in the
> >> supervision tree of the Kernel application related to Logger. I don't
> know
> >> if this has anything to do with your problem, but the chances exist, of
> >> course.
> >>
> >> As you say, Kernel is the last application to be taken down (it's the
> last
> >> element in the #state.running of application_controller), and the state
> you
> >> see is probably stale in the sense that all applications listed before
> >> kernel in the 'running' list should already have been terminated. From
> the
> >> crash dump - can you see how many processes are still alive? Are they
> all
> >> part of kernel, or are there more applications that did not completely
> >> terminate?
> >>
> >> And if we are to look closer at the logger track - do you know which
> logger
> >> handlers were active? Are there any children under the logger_sup
> >> supervisor, and if so, what are their state? You mention lager above,
> do you
> >> know how lager and logger are connected (is lager a handler under
> logger, or
> >> does it run independently)?
> >>
> >> If you see anything suspicious related to logger, please feel free to
> send
> >> me the crash dump off-list, and I'll see if I can figure out anything
> more.
> >>
> >> Regards
> >> /siri
> >>
> >>
> >>
> >> Den man. 2. jul. 2018 kl. 18:24 skrev Roger Lipscombe
> >> <roger@REDACTED>:
> >>>
> >>> Right, so I've done some digging. I got a crash dump and by looking at
> >>> the stack trace for application_controller, I can see that it's
> >>> hanging in application_controller:terminate, presumably while waiting
> >>> for one of the processes to stop. The pid given is <0.893.0>. Looking
> >>> a bit further up the stack trace gets me the process state (possibly
> >>> stale?), which contains a bunch of applications and pids. The pid it's
> >>> hanging on is 'kernel', which I'd expect to be the last (?)
> >>> application to be shut down. The process state (assuming it's not
> >>> stale) has a load of other applications in there with pids. There are
> >>> a bunch of ours, but also others, such as runtime_tools, cowboy,
> >>> amqp_client, hackney, lager, etc.
> >>>
> >>> Is any of this useful in finding out why it's hanging?
> >>>
> >>> On 2 July 2018 at 16:54, Roger Lipscombe <roger@REDACTED>
> wrote:
> >>> > This sounds similar to
> >>> >
> http://erlang.org/pipermail/erlang-questions/2012-December/071223.html,
> >>> > but it's not quite the same, as far as I can tell.
> >>> >
> >>> > I've noticed that, at some point since upgrading from OTP-20.3 to
> >>> > OTP-21.0 (along with the necessary dependency updates), my Erlang
> >>> > nodes are no longer stopping at the end of our system test run.
> >>> >
> >>> > The nodes are orchestrated by having 'erlexec' run a bash script
> which
> >>> > uses (effectively) 'foo/bin/foo foreground &'. I'm relying on erlexec
> >>> > killing the bash script and that killing the nodes. This works fine
> >>> > when the nodes are using OTP-20.3, but not with OTP-21.0.
> >>> >
> >>> > If I connect to the node, I can issue 'init:stop()', and it returns
> >>> > ok, but nothing happens. If I use 'application:which_applications()',
> >>> > I get a timeout.
> >>> >
> >>> > Unlike the linked discussion, I _can_ repeatedly connect a remote
> >>> > shell (using erl -remsh), but I have to resort to erlang:halt() to
> >>> > stop the node. Since my system test environment relies on orderly
> >>> > process group teardown to stop the nodes, that's not useful.
> >>> >
> >>> > As I understand it, process group teardown results in SIGTERM, which
> >>> > results in a call to init:stop, which should stop the node. It isn't.
> >>> >
> >>> > How do I figure out why init:stop() isn't working?
> >>> _______________________________________________
> >>> erlang-questions mailing list
> >>> erlang-questions@REDACTED
> >>> http://erlang.org/mailman/listinfo/erlang-questions
>

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180713/e35fc509/attachment.htm>