[erlang-bugs] Strange application shutdown deadlock

Dan Gudmundsson dgud@REDACTED
Thu Feb 20 12:37:22 CET 2014


I have merged a fix now, will be available whenever master is pushed to
github and in 17.0-RC2

/Dan


On Tue, Jan 28, 2014 at 1:32 PM, Tim Watson <watson.timothy@REDACTED>wrote:

> Hi Siri.
>
> Ok thanks for confirming that. Do you reckon it'll make a bugfix/patch
> release or are we looking at R17 now?
>
> Thanks,
> Tim
>
> On 28 Jan 2014, at 10:00, Siri Hansen <erlangsiri@REDACTED> wrote:
>
> Hi Tim, I'm sorry, but this has not made it into the latest release. The
> reason is that there has been too many other tasks with higher priority.
> Regards
> /siri
>
>
> 2014/1/27 Tim Watson <watson.timothy@REDACTED>
>
>> Hi Siri,
>>
>> Did this bug make it into the latest release? I didn't see anything in
>> the release notes.
>>
>> Cheers,
>> Tim
>>
>>
>>
>> On 19 June 2013 10:47, Siri Hansen <erlangsiri@REDACTED> wrote:
>>
>>> Hi Tim - I'm so sorry for the long delay here. I agree that this is a
>>> bug in the application master and I have written a ticket for it so it will
>>> be prioritized into out backlog.
>>> Thanks for the report!
>>> Regards
>>> /siri
>>>
>>>
>>> 2013/5/29 Tim Watson <watson.timothy@REDACTED>
>>>
>>>> Any word from the OTP folks on this one?
>>>>
>>>> On 24 May 2013, at 18:41, Tim Watson <watson.timothy@REDACTED> wrote:
>>>>
>>>> > Gah, sorry folks - this has nothing to do with release handling, that
>>>> was a red herring. Someone just pointed out that the call to get_child
>>>> originates in a status check in our code.
>>>> >
>>>> > This still looks like a bug to me though, since if you're going to
>>>> handle "other" messages in terminate_loop you ought to ensure they can't
>>>> deadlock the vm's shutdown sequence.
>>>> >
>>>> > Cheers,
>>>> > Tim
>>>> >
>>>> > On 24 May 2013, at 15:45, Fred Hebert <mononcqc@REDACTED> wrote:
>>>> >
>>>> >> Quick question: are you running a release?
>>>> >>
>>>> >> If so, last time I've seen deadlocks like that was solved by making
>>>> sure
>>>> >> *all* my applications did depend on stdlib and kernel in their app
>>>> file.
>>>> >> When I skipped them, sometimes I'd find that things would lock up.
>>>> >>
>>>> >> My guess was that dependencies from stdlib or kernel got unloaded
>>>> before
>>>> >> my app and broke something, but I'm not sure -- In my case, I wasn't
>>>> >> able to inspect the node as it appeared to be 100% blocked.
>>>> >>
>>>> >> Adding the apps ended up fixing the problem on the next shutdown. I'm
>>>> >> not sure if it might be a good fix for you, but it's a stab in the
>>>> dark,
>>>> >>
>>>> >> Regards,
>>>> >> Fred.
>>>> >>
>>>> >> On 05/24, Tim Watson wrote:
>>>> >>> We came across this at a customer's site, where one of the nodes
>>>> was apparently in the process of stopping and had been in that state for at
>>>> least 24 hours. The short version is that an application_master appears to
>>>> be stuck waiting for a child pid (is that the X process, or the root
>>>> supervisor?) which is *not* linked to it...
>>>> >>>
>>>> >>> The application controller is in the process of stopping an
>>>> application, during which process a `get_child' message appears to have
>>>> come in to that application's application_master from somewhere - we are
>>>> *not* running appmon, so I'm really confused how this can happen, as the
>>>> only other place where I see (indirect) calls are via the sasl
>>>> release_handler!? At the bottom of this email is a dump for the
>>>> application_controller and the application_master for the app it is trying
>>>> to shut down. I can verify that the pid which the application_master is
>>>> waiting on is definitely not linked to it - i.e., process_info(links,
>>>> AppMasterPid) doesn't contain the process <0.256.0> that the master appears
>>>> to be waiting on.
>>>> >>>
>>>> >>> My reading of the code is that the application_master cannot end up
>>>> in get_child_i unless a get_child request was made which arrives whilst it
>>>> is in its terminate loop. As I said, we're not using appmon, therefore I
>>>> assume this originated in the sasl application's release_handler_1, though
>>>> I'm not sure quite which route would take us there. The relevant bit of
>>>> code in application_master appears to be:
>>>> >>>
>>>> >>> get_child_i(Child) ->
>>>> >>>   Child ! {self(), get_child},
>>>> >>>   receive
>>>> >>>   {Child, GrandChild, Mod} -> {GrandChild, Mod}
>>>> >>>   end.
>>>> >>>
>>>> >>> This in turn originates, I'd guess, in the third receive clause of
>>>> terminate_loop/2. Anyway, should that code not be dealing with a
>>>> potentially dead pid for Child, either by handling links effectively -
>>>> perhaps there is an EXIT signal in the mailbox already which is being
>>>> ignored here in get_child_i/1 - or by some other means?
>>>> >>>
>>>> >>> What follows below is the trace/dump output. Feel free to poke me
>>>> for more info as needed.
>>>> >>>
>>>> >>> Cheers,
>>>> >>> Tim
>>>> >>>
>>>> >>> [TRACE/DUMP]
>>>> >>>
>>>> >>> pid: <6676.7.0>
>>>> >>> registered name: application_controller
>>>> >>> stacktrace: [{application_master,call,2,
>>>> >>>
>>>>  [{file,"application_master.erl"},{line,75}]},
>>>> >>>            {application_controller,stop_appl,3,
>>>> >>>
>>>>  [{file,"application_controller.erl"},
>>>> >>>                                     {line,1393}]},
>>>> >>>            {application_controller,handle_call,3,
>>>> >>>
>>>>  [{file,"application_controller.erl"},
>>>> >>>                                     {line,810}]},
>>>> >>>
>>>>  {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]}]
>>>> >>> -------------------------
>>>> >>> Program counter: 0x00007f9bf9a53720 (application_master:call/2 +
>>>> 288)
>>>> >>> CP: 0x0000000000000000 (invalid)
>>>> >>> arity = 0
>>>> >>>
>>>> >>> 0x00007f9bd7948360 Return addr 0x00007f9bfb97de40
>>>> (application_controller:stop_appl/3 + 176)
>>>> >>> y(0)     #Ref<0.0.20562.258360>
>>>> >>> y(1)     #Ref<0.0.20562.258361>
>>>> >>> y(2)     []
>>>> >>>
>>>> >>> 0x00007f9bd7948380 Return addr 0x00007f9bfb973c68
>>>> (application_controller:handle_call/3 + 1392)
>>>> >>> y(0)     temporary
>>>> >>> y(1)     rabbitmq_web_dispatch
>>>> >>>
>>>> >>> 0x00007f9bd7948398 Return addr 0x00007f9bf9a600c8
>>>> (gen_server:handle_msg/5 + 272)
>>>> >>> y(0)
>>>> {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2
>>>> bytes>>},{default_pass,<<8
>>>> bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
>>>> >>> y(1)     rabbitmq_web_dispatch
>>>> >>> y(2)
>>>> [{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}]
>>>> >>> y(3)
>>>> [{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}]
>>>> >>>
>>>> >>> 0x00007f9bd79483c0 Return addr 0x00000000008827d8 (<terminate
>>>> process normally>)
>>>> >>> y(0)     application_controller
>>>> >>> y(1)
>>>> {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2
>>>> bytes>>},{default_pass,<<8
>>>> bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
>>>> >>> y(2)     application_controller
>>>> >>> y(3)     <0.2.0>
>>>> >>> y(4)     {stop_application,rabbitmq_web_dispatch}
>>>> >>> y(5)     {<0.5864.275>,#Ref<0.0.20562.258345>}
>>>> >>> y(6)     Catch 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
>>>> >>> -------------------------
>>>> >>>
>>>> >>> pid: <6676.255.0>
>>>> >>> registered name: none
>>>> >>> stacktrace: [{application_master,get_child_i,1,
>>>> >>>
>>>>  [{file,"application_master.erl"},{line,392}]},
>>>> >>>            {application_master,handle_msg,2,
>>>> >>>
>>>>  [{file,"application_master.erl"},{line,216}]},
>>>> >>>            {application_master,terminate_loop,2,
>>>> >>>
>>>>  [{file,"application_master.erl"},{line,206}]},
>>>> >>>            {application_master,terminate,2,
>>>> >>>
>>>>  [{file,"application_master.erl"},{line,227}]},
>>>> >>>            {application_master,handle_msg,2,
>>>> >>>
>>>>  [{file,"application_master.erl"},{line,219}]},
>>>> >>>            {application_master,main_loop,2,
>>>> >>>
>>>>  [{file,"application_master.erl"},{line,194}]},
>>>> >>>
>>>>  {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]
>>>> >>> -------------------------
>>>> >>> Program counter: 0x00007f9bf9a570e0
>>>> (application_master:get_child_i/1 + 120)
>>>> >>> CP: 0x0000000000000000 (invalid)
>>>> >>> arity = 0
>>>> >>>
>>>> >>> 0x00007f9c1adc3dc8 Return addr 0x00007f9bf9a54eb0
>>>> (application_master:handle_msg/2 + 280)
>>>> >>> y(0)     <0.256.0>
>>>> >>>
>>>> >>> 0x00007f9c1adc3dd8 Return addr 0x00007f9bf9a54d20
>>>> (application_master:terminate_loop/2 + 520)
>>>> >>> y(0)     #Ref<0.0.20562.258362>
>>>> >>> y(1)     <0.9596.275>
>>>> >>> y(2)
>>>> {state,<0.256.0>,{appl_data,rabbitmq_web_dispatch,[],undefined,{rabbit_web_dispatch_app,[]},[rabbit_web_dispatch,rabbit_web_dispatch_app,rabbit_web_dispatch_registry,rabbit_web_dispatch_sup,rabbit_web_dispatch_util,rabbit_webmachine],[],infinity,infinity},[],0,<0.29.0>}
>>>> >>>
>>>> >>> 0x00007f9c1adc3df8 Return addr 0x00007f9bf9a55108
>>>> (application_master:terminate/2 + 192)
>>>> >>> y(0)     <0.256.0>
>>>> >>>
>>>> >>> 0x00007f9c1adc3e08 Return addr 0x00007f9bf9a54f70
>>>> (application_master:handle_msg/2 + 472)
>>>> >>> y(0)     []
>>>> >>> y(1)     normal
>>>> >>>
>>>> >>> 0x00007f9c1adc3e20 Return addr 0x00007f9bf9a54a60
>>>> (application_master:main_loop/2 + 1600)
>>>> >>> y(0)     <0.7.0>
>>>> >>> y(1)     #Ref<0.0.20562.258360>
>>>> >>> y(2)     Catch 0x00007f9bf9a54f70 (application_master:handle_msg/2
>>>> + 472)
>>>> >>>
>>>> >>> 0x00007f9c1adc3e40 Return addr 0x00007f9bfb969420
>>>> (proc_lib:init_p_do_apply/3 + 56)
>>>> >>> y(0)     <0.7.0>
>>>> >>>
>>>> >>> 0x00007f9c1adc3e50 Return addr 0x00000000008827d8 (<terminate
>>>> process normally>)
>>>> >>> y(0)     Catch 0x00007f9bfb969440 (proc_lib:init_p_do_apply/3 + 88)
>>>> >>> -------------------------
>>>> >>>
>>>> >>
>>>> >>
>>>> >>
>>>> >>> _______________________________________________
>>>> >>> erlang-bugs mailing list
>>>> >>> erlang-bugs@REDACTED
>>>> >>> http://erlang.org/mailman/listinfo/erlang-bugs
>>>> >>
>>>> _______________________________________________
>>>> erlang-bugs mailing list
>>>> erlang-bugs@REDACTED
>>>> http://erlang.org/mailman/listinfo/erlang-bugs
>>>>
>>>
>>>
>>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20140220/811bc43f/attachment.htm>


More information about the erlang-bugs mailing list