<div dir="ltr">For information: After some off-list trouble shooting we found the reason for this issue to be a dead-lock in error_logger and the logger_sup supervisor. A correction is merged to maint and master today.<div>Regards</div><div>/siri<br><div><br></div><div><br><div class="gmail_quote"><div dir="ltr">Den tir. 3. jul. 2018 kl. 13:48 skrev Roger Lipscombe <<a href="mailto:roger@differentpla.net" target="_blank">roger@differentpla.net</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Note that I can reproduce this consistently, so if there's any<br>
experimentation (tracing, patches, etc.) you want me to try, I should<br>
be able to do that.<br>
<br>
On 3 July 2018 at 12:45, Roger Lipscombe <<a href="mailto:roger@differentpla.net" target="_blank">roger@differentpla.net</a>> wrote:<br>
> Before I go looking for permission to release the crash dump (it'll<br>
> take at least a couple of hours; I'll ask), let's see what we can<br>
> figure out from here.<br>
><br>
> There are 44 processes alive. Most of them look fairly innocuous, but,<br>
> in terms of logger, I see:<br>
><br>
> - A process <0.2130.0> sitting in erlang:apply, issuing<br>
> error_logger:stop, which is a gen_server:call, timeout infinity, to<br>
> <0.927.0>. Stack is logger_server:'-call_h_async/4-fun-0-'/2 + 152 -><br>
> logger_server:call_h/4 + 96 -> error_logger:removing_handler/1 + 112<br>
> -> error_logger:stop/0 + 72<br>
> - <0.927.0> is 'logger_sup'; it appears to be trying to call<br>
> <0.890.0>, which is 'logger'.<br>
> - <0.890.0> is, apparently, sitting in proc_lib:init_p_do_apply/3<br>
><br>
> I suspect that logger_sup has failed in supervisor:do_terminate, and<br>
> is attempting to report an error, via logger (which is problematic, I<br>
> guess). I see the following on the logger_sup stack:<br>
><br>
> #{level => error,<br>
> meta =><br>
> #{domain => [otp,sasl],<br>
> error_logger => #{tag => error_report,type => supervisor_report},<br>
> file => "supervisor.erl",<br>
> gl => <0.894.0>,<br>
> line => 838,<br>
> logger_formatter => #{title => "SUPERVISOR REPORT"},<br>
> mfa => {supervisor,do_terminate,2},<br>
> pid => <0.927.0>,<br>
> report_cb => fun logger:format_otp_report/1,time => 1530616282326266},<br>
> msg =><br>
> {report,#{label => {supervisor,shutdown_error},<br>
> report =><br>
> [{supervisor,{local,logger_sup}},<br>
> {errorContext,shutdown_error},<br>
> {reason,killed},<br>
> {offender,[{pid,<0.929.0>},<br>
> {id,default},<br>
> {mfargs,{logger_std_h,start_link,undefined}},<br>
> {restart_type,temporary},<br>
> {shutdown,2000},<br>
> {child_type,worker}]}]}}}<br>
><br>
> At this point, logger_sup has zero links, is monitoring 'logger' and<br>
> is monitored by kernel_sup and the process that's calling<br>
> error_logger:stop.<br>
><br>
> For a *running* node, I get:<br>
><br>
> 1> supervisor:which_children(logger_sup).<br>
> [{error_logger,<0.1023.0>,worker,dynamic},<br>
> {default,<0.929.0>,worker,[logger_std_h]}]<br>
><br>
> Assuming that the low process IDs are relatively stable, that<br>
> <0.929.0> for the default logger correlates with the offender in the<br>
> ?reportError above.<br>
><br>
> I *believe* that lager's running independently. That is: I've done<br>
> nothing deliberate to connect the two. I'm using lager 3.6.3.<br>
><br>
> Thanks,<br>
> Roger.<br>
><br>
> On 3 July 2018 at 11:32, Siri Hansen <<a href="mailto:erlangsiri@gmail.com" target="_blank">erlangsiri@gmail.com</a>> wrote:<br>
>> Hi Roger,<br>
>><br>
>> there's been a few changes in the shutdown procedure (init.erl) with the<br>
>> introduction of Logger in OTP-21. There are also some changes in the<br>
>> supervision tree of the Kernel application related to Logger. I don't know<br>
>> if this has anything to do with your problem, but the chances exist, of<br>
>> course.<br>
>><br>
>> As you say, Kernel is the last application to be taken down (it's the last<br>
>> element in the #state.running of application_controller), and the state you<br>
>> see is probably stale in the sense that all applications listed before<br>
>> kernel in the 'running' list should already have been terminated. From the<br>
>> crash dump - can you see how many processes are still alive? Are they all<br>
>> part of kernel, or are there more applications that did not completely<br>
>> terminate?<br>
>><br>
>> And if we are to look closer at the logger track - do you know which logger<br>
>> handlers were active? Are there any children under the logger_sup<br>
>> supervisor, and if so, what are their state? You mention lager above, do you<br>
>> know how lager and logger are connected (is lager a handler under logger, or<br>
>> does it run independently)?<br>
>><br>
>> If you see anything suspicious related to logger, please feel free to send<br>
>> me the crash dump off-list, and I'll see if I can figure out anything more.<br>
>><br>
>> Regards<br>
>> /siri<br>
>><br>
>><br>
>><br>
>> Den man. 2. jul. 2018 kl. 18:24 skrev Roger Lipscombe<br>
>> <<a href="mailto:roger@differentpla.net" target="_blank">roger@differentpla.net</a>>:<br>
>>><br>
>>> Right, so I've done some digging. I got a crash dump and by looking at<br>
>>> the stack trace for application_controller, I can see that it's<br>
>>> hanging in application_controller:terminate, presumably while waiting<br>
>>> for one of the processes to stop. The pid given is <0.893.0>. Looking<br>
>>> a bit further up the stack trace gets me the process state (possibly<br>
>>> stale?), which contains a bunch of applications and pids. The pid it's<br>
>>> hanging on is 'kernel', which I'd expect to be the last (?)<br>
>>> application to be shut down. The process state (assuming it's not<br>
>>> stale) has a load of other applications in there with pids. There are<br>
>>> a bunch of ours, but also others, such as runtime_tools, cowboy,<br>
>>> amqp_client, hackney, lager, etc.<br>
>>><br>
>>> Is any of this useful in finding out why it's hanging?<br>
>>><br>
>>> On 2 July 2018 at 16:54, Roger Lipscombe <<a href="mailto:roger@differentpla.net" target="_blank">roger@differentpla.net</a>> wrote:<br>
>>> > This sounds similar to<br>
>>> > <a href="http://erlang.org/pipermail/erlang-questions/2012-December/071223.html" rel="noreferrer" target="_blank">http://erlang.org/pipermail/erlang-questions/2012-December/071223.html</a>,<br>
>>> > but it's not quite the same, as far as I can tell.<br>
>>> ><br>
>>> > I've noticed that, at some point since upgrading from OTP-20.3 to<br>
>>> > OTP-21.0 (along with the necessary dependency updates), my Erlang<br>
>>> > nodes are no longer stopping at the end of our system test run.<br>
>>> ><br>
>>> > The nodes are orchestrated by having 'erlexec' run a bash script which<br>
>>> > uses (effectively) 'foo/bin/foo foreground &'. I'm relying on erlexec<br>
>>> > killing the bash script and that killing the nodes. This works fine<br>
>>> > when the nodes are using OTP-20.3, but not with OTP-21.0.<br>
>>> ><br>
>>> > If I connect to the node, I can issue 'init:stop()', and it returns<br>
>>> > ok, but nothing happens. If I use 'application:which_applications()',<br>
>>> > I get a timeout.<br>
>>> ><br>
>>> > Unlike the linked discussion, I _can_ repeatedly connect a remote<br>
>>> > shell (using erl -remsh), but I have to resort to erlang:halt() to<br>
>>> > stop the node. Since my system test environment relies on orderly<br>
>>> > process group teardown to stop the nodes, that's not useful.<br>
>>> ><br>
>>> > As I understand it, process group teardown results in SIGTERM, which<br>
>>> > results in a call to init:stop, which should stop the node. It isn't.<br>
>>> ><br>
>>> > How do I figure out why init:stop() isn't working?<br>
>>> _______________________________________________<br>
>>> erlang-questions mailing list<br>
>>> <a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>
>>> <a href="http://erlang.org/mailman/listinfo/erlang-questions" rel="noreferrer" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>
</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br></blockquote></div></div></div></div>