[erlang-bugs] Strange application shutdown deadlock
Siri Hansen
erlangsiri@REDACTED
Tue Jan 28 11:00:47 CET 2014
Hi Tim, I'm sorry, but this has not made it into the latest release. The
reason is that there has been too many other tasks with higher priority.
Regards
/siri
2014/1/27 Tim Watson <watson.timothy@REDACTED>
> Hi Siri,
>
> Did this bug make it into the latest release? I didn't see anything in the
> release notes.
>
> Cheers,
> Tim
>
>
>
> On 19 June 2013 10:47, Siri Hansen <erlangsiri@REDACTED> wrote:
>
>> Hi Tim - I'm so sorry for the long delay here. I agree that this is a bug
>> in the application master and I have written a ticket for it so it will be
>> prioritized into out backlog.
>> Thanks for the report!
>> Regards
>> /siri
>>
>>
>> 2013/5/29 Tim Watson <watson.timothy@REDACTED>
>>
>>> Any word from the OTP folks on this one?
>>>
>>> On 24 May 2013, at 18:41, Tim Watson <watson.timothy@REDACTED> wrote:
>>>
>>> > Gah, sorry folks - this has nothing to do with release handling, that
>>> was a red herring. Someone just pointed out that the call to get_child
>>> originates in a status check in our code.
>>> >
>>> > This still looks like a bug to me though, since if you're going to
>>> handle "other" messages in terminate_loop you ought to ensure they can't
>>> deadlock the vm's shutdown sequence.
>>> >
>>> > Cheers,
>>> > Tim
>>> >
>>> > On 24 May 2013, at 15:45, Fred Hebert <mononcqc@REDACTED> wrote:
>>> >
>>> >> Quick question: are you running a release?
>>> >>
>>> >> If so, last time I've seen deadlocks like that was solved by making
>>> sure
>>> >> *all* my applications did depend on stdlib and kernel in their app
>>> file.
>>> >> When I skipped them, sometimes I'd find that things would lock up.
>>> >>
>>> >> My guess was that dependencies from stdlib or kernel got unloaded
>>> before
>>> >> my app and broke something, but I'm not sure -- In my case, I wasn't
>>> >> able to inspect the node as it appeared to be 100% blocked.
>>> >>
>>> >> Adding the apps ended up fixing the problem on the next shutdown. I'm
>>> >> not sure if it might be a good fix for you, but it's a stab in the
>>> dark,
>>> >>
>>> >> Regards,
>>> >> Fred.
>>> >>
>>> >> On 05/24, Tim Watson wrote:
>>> >>> We came across this at a customer's site, where one of the nodes was
>>> apparently in the process of stopping and had been in that state for at
>>> least 24 hours. The short version is that an application_master appears to
>>> be stuck waiting for a child pid (is that the X process, or the root
>>> supervisor?) which is *not* linked to it...
>>> >>>
>>> >>> The application controller is in the process of stopping an
>>> application, during which process a `get_child' message appears to have
>>> come in to that application's application_master from somewhere - we are
>>> *not* running appmon, so I'm really confused how this can happen, as the
>>> only other place where I see (indirect) calls are via the sasl
>>> release_handler!? At the bottom of this email is a dump for the
>>> application_controller and the application_master for the app it is trying
>>> to shut down. I can verify that the pid which the application_master is
>>> waiting on is definitely not linked to it - i.e., process_info(links,
>>> AppMasterPid) doesn't contain the process <0.256.0> that the master appears
>>> to be waiting on.
>>> >>>
>>> >>> My reading of the code is that the application_master cannot end up
>>> in get_child_i unless a get_child request was made which arrives whilst it
>>> is in its terminate loop. As I said, we're not using appmon, therefore I
>>> assume this originated in the sasl application's release_handler_1, though
>>> I'm not sure quite which route would take us there. The relevant bit of
>>> code in application_master appears to be:
>>> >>>
>>> >>> get_child_i(Child) ->
>>> >>> Child ! {self(), get_child},
>>> >>> receive
>>> >>> {Child, GrandChild, Mod} -> {GrandChild, Mod}
>>> >>> end.
>>> >>>
>>> >>> This in turn originates, I'd guess, in the third receive clause of
>>> terminate_loop/2. Anyway, should that code not be dealing with a
>>> potentially dead pid for Child, either by handling links effectively -
>>> perhaps there is an EXIT signal in the mailbox already which is being
>>> ignored here in get_child_i/1 - or by some other means?
>>> >>>
>>> >>> What follows below is the trace/dump output. Feel free to poke me
>>> for more info as needed.
>>> >>>
>>> >>> Cheers,
>>> >>> Tim
>>> >>>
>>> >>> [TRACE/DUMP]
>>> >>>
>>> >>> pid: <6676.7.0>
>>> >>> registered name: application_controller
>>> >>> stacktrace: [{application_master,call,2,
>>> >>>
>>> [{file,"application_master.erl"},{line,75}]},
>>> >>> {application_controller,stop_appl,3,
>>> >>>
>>> [{file,"application_controller.erl"},
>>> >>> {line,1393}]},
>>> >>> {application_controller,handle_call,3,
>>> >>>
>>> [{file,"application_controller.erl"},
>>> >>> {line,810}]},
>>> >>>
>>> {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]}]
>>> >>> -------------------------
>>> >>> Program counter: 0x00007f9bf9a53720 (application_master:call/2 + 288)
>>> >>> CP: 0x0000000000000000 (invalid)
>>> >>> arity = 0
>>> >>>
>>> >>> 0x00007f9bd7948360 Return addr 0x00007f9bfb97de40
>>> (application_controller:stop_appl/3 + 176)
>>> >>> y(0) #Ref<0.0.20562.258360>
>>> >>> y(1) #Ref<0.0.20562.258361>
>>> >>> y(2) []
>>> >>>
>>> >>> 0x00007f9bd7948380 Return addr 0x00007f9bfb973c68
>>> (application_controller:handle_call/3 + 1392)
>>> >>> y(0) temporary
>>> >>> y(1) rabbitmq_web_dispatch
>>> >>>
>>> >>> 0x00007f9bd7948398 Return addr 0x00007f9bf9a600c8
>>> (gen_server:handle_msg/5 + 272)
>>> >>> y(0)
>>> {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2
>>> bytes>>},{default_pass,<<8
>>> bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
>>> >>> y(1) rabbitmq_web_dispatch
>>> >>> y(2)
>>> [{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}]
>>> >>> y(3)
>>> [{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}]
>>> >>>
>>> >>> 0x00007f9bd79483c0 Return addr 0x00000000008827d8 (<terminate
>>> process normally>)
>>> >>> y(0) application_controller
>>> >>> y(1)
>>> {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2
>>> bytes>>},{default_pass,<<8
>>> bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
>>> >>> y(2) application_controller
>>> >>> y(3) <0.2.0>
>>> >>> y(4) {stop_application,rabbitmq_web_dispatch}
>>> >>> y(5) {<0.5864.275>,#Ref<0.0.20562.258345>}
>>> >>> y(6) Catch 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
>>> >>> -------------------------
>>> >>>
>>> >>> pid: <6676.255.0>
>>> >>> registered name: none
>>> >>> stacktrace: [{application_master,get_child_i,1,
>>> >>>
>>> [{file,"application_master.erl"},{line,392}]},
>>> >>> {application_master,handle_msg,2,
>>> >>>
>>> [{file,"application_master.erl"},{line,216}]},
>>> >>> {application_master,terminate_loop,2,
>>> >>>
>>> [{file,"application_master.erl"},{line,206}]},
>>> >>> {application_master,terminate,2,
>>> >>>
>>> [{file,"application_master.erl"},{line,227}]},
>>> >>> {application_master,handle_msg,2,
>>> >>>
>>> [{file,"application_master.erl"},{line,219}]},
>>> >>> {application_master,main_loop,2,
>>> >>>
>>> [{file,"application_master.erl"},{line,194}]},
>>> >>>
>>> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]
>>> >>> -------------------------
>>> >>> Program counter: 0x00007f9bf9a570e0
>>> (application_master:get_child_i/1 + 120)
>>> >>> CP: 0x0000000000000000 (invalid)
>>> >>> arity = 0
>>> >>>
>>> >>> 0x00007f9c1adc3dc8 Return addr 0x00007f9bf9a54eb0
>>> (application_master:handle_msg/2 + 280)
>>> >>> y(0) <0.256.0>
>>> >>>
>>> >>> 0x00007f9c1adc3dd8 Return addr 0x00007f9bf9a54d20
>>> (application_master:terminate_loop/2 + 520)
>>> >>> y(0) #Ref<0.0.20562.258362>
>>> >>> y(1) <0.9596.275>
>>> >>> y(2)
>>> {state,<0.256.0>,{appl_data,rabbitmq_web_dispatch,[],undefined,{rabbit_web_dispatch_app,[]},[rabbit_web_dispatch,rabbit_web_dispatch_app,rabbit_web_dispatch_registry,rabbit_web_dispatch_sup,rabbit_web_dispatch_util,rabbit_webmachine],[],infinity,infinity},[],0,<0.29.0>}
>>> >>>
>>> >>> 0x00007f9c1adc3df8 Return addr 0x00007f9bf9a55108
>>> (application_master:terminate/2 + 192)
>>> >>> y(0) <0.256.0>
>>> >>>
>>> >>> 0x00007f9c1adc3e08 Return addr 0x00007f9bf9a54f70
>>> (application_master:handle_msg/2 + 472)
>>> >>> y(0) []
>>> >>> y(1) normal
>>> >>>
>>> >>> 0x00007f9c1adc3e20 Return addr 0x00007f9bf9a54a60
>>> (application_master:main_loop/2 + 1600)
>>> >>> y(0) <0.7.0>
>>> >>> y(1) #Ref<0.0.20562.258360>
>>> >>> y(2) Catch 0x00007f9bf9a54f70 (application_master:handle_msg/2 +
>>> 472)
>>> >>>
>>> >>> 0x00007f9c1adc3e40 Return addr 0x00007f9bfb969420
>>> (proc_lib:init_p_do_apply/3 + 56)
>>> >>> y(0) <0.7.0>
>>> >>>
>>> >>> 0x00007f9c1adc3e50 Return addr 0x00000000008827d8 (<terminate
>>> process normally>)
>>> >>> y(0) Catch 0x00007f9bfb969440 (proc_lib:init_p_do_apply/3 + 88)
>>> >>> -------------------------
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >>> _______________________________________________
>>> >>> erlang-bugs mailing list
>>> >>> erlang-bugs@REDACTED
>>> >>> http://erlang.org/mailman/listinfo/erlang-bugs
>>> >>
>>> _______________________________________________
>>> erlang-bugs mailing list
>>> erlang-bugs@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20140128/ba26c83a/attachment.htm>
More information about the erlang-bugs
mailing list