[erlang-bugs] Strange application shutdown deadlock
Fred Hebert
mononcqc@REDACTED
Fri May 24 16:45:47 CEST 2013
Quick question: are you running a release?
If so, last time I've seen deadlocks like that was solved by making sure
*all* my applications did depend on stdlib and kernel in their app file.
When I skipped them, sometimes I'd find that things would lock up.
My guess was that dependencies from stdlib or kernel got unloaded before
my app and broke something, but I'm not sure -- In my case, I wasn't
able to inspect the node as it appeared to be 100% blocked.
Adding the apps ended up fixing the problem on the next shutdown. I'm
not sure if it might be a good fix for you, but it's a stab in the dark,
Regards,
Fred.
On 05/24, Tim Watson wrote:
> We came across this at a customer's site, where one of the nodes was apparently in the process of stopping and had been in that state for at least 24 hours. The short version is that an application_master appears to be stuck waiting for a child pid (is that the X process, or the root supervisor?) which is *not* linked to it...
>
> The application controller is in the process of stopping an application, during which process a `get_child' message appears to have come in to that application's application_master from somewhere - we are *not* running appmon, so I'm really confused how this can happen, as the only other place where I see (indirect) calls are via the sasl release_handler!? At the bottom of this email is a dump for the application_controller and the application_master for the app it is trying to shut down. I can verify that the pid which the application_master is waiting on is definitely not linked to it - i.e., process_info(links, AppMasterPid) doesn't contain the process <0.256.0> that the master appears to be waiting on.
>
> My reading of the code is that the application_master cannot end up in get_child_i unless a get_child request was made which arrives whilst it is in its terminate loop. As I said, we're not using appmon, therefore I assume this originated in the sasl application's release_handler_1, though I'm not sure quite which route would take us there. The relevant bit of code in application_master appears to be:
>
> get_child_i(Child) ->
> Child ! {self(), get_child},
> receive
> {Child, GrandChild, Mod} -> {GrandChild, Mod}
> end.
>
> This in turn originates, I'd guess, in the third receive clause of terminate_loop/2. Anyway, should that code not be dealing with a potentially dead pid for Child, either by handling links effectively - perhaps there is an EXIT signal in the mailbox already which is being ignored here in get_child_i/1 - or by some other means?
>
> What follows below is the trace/dump output. Feel free to poke me for more info as needed.
>
> Cheers,
> Tim
>
> [TRACE/DUMP]
>
> pid: <6676.7.0>
> registered name: application_controller
> stacktrace: [{application_master,call,2,
> [{file,"application_master.erl"},{line,75}]},
> {application_controller,stop_appl,3,
> [{file,"application_controller.erl"},
> {line,1393}]},
> {application_controller,handle_call,3,
> [{file,"application_controller.erl"},
> {line,810}]},
> {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]}]
> -------------------------
> Program counter: 0x00007f9bf9a53720 (application_master:call/2 + 288)
> CP: 0x0000000000000000 (invalid)
> arity = 0
>
> 0x00007f9bd7948360 Return addr 0x00007f9bfb97de40 (application_controller:stop_appl/3 + 176)
> y(0) #Ref<0.0.20562.258360>
> y(1) #Ref<0.0.20562.258361>
> y(2) []
>
> 0x00007f9bd7948380 Return addr 0x00007f9bfb973c68 (application_controller:handle_call/3 + 1392)
> y(0) temporary
> y(1) rabbitmq_web_dispatch
>
> 0x00007f9bd7948398 Return addr 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
> y(0) {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
> y(1) rabbitmq_web_dispatch
> y(2) [{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}]
> y(3) [{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}]
>
> 0x00007f9bd79483c0 Return addr 0x00000000008827d8 (<terminate process normally>)
> y(0) application_controller
> y(1) {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
> y(2) application_controller
> y(3) <0.2.0>
> y(4) {stop_application,rabbitmq_web_dispatch}
> y(5) {<0.5864.275>,#Ref<0.0.20562.258345>}
> y(6) Catch 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
> -------------------------
>
> pid: <6676.255.0>
> registered name: none
> stacktrace: [{application_master,get_child_i,1,
> [{file,"application_master.erl"},{line,392}]},
> {application_master,handle_msg,2,
> [{file,"application_master.erl"},{line,216}]},
> {application_master,terminate_loop,2,
> [{file,"application_master.erl"},{line,206}]},
> {application_master,terminate,2,
> [{file,"application_master.erl"},{line,227}]},
> {application_master,handle_msg,2,
> [{file,"application_master.erl"},{line,219}]},
> {application_master,main_loop,2,
> [{file,"application_master.erl"},{line,194}]},
> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]
> -------------------------
> Program counter: 0x00007f9bf9a570e0 (application_master:get_child_i/1 + 120)
> CP: 0x0000000000000000 (invalid)
> arity = 0
>
> 0x00007f9c1adc3dc8 Return addr 0x00007f9bf9a54eb0 (application_master:handle_msg/2 + 280)
> y(0) <0.256.0>
>
> 0x00007f9c1adc3dd8 Return addr 0x00007f9bf9a54d20 (application_master:terminate_loop/2 + 520)
> y(0) #Ref<0.0.20562.258362>
> y(1) <0.9596.275>
> y(2) {state,<0.256.0>,{appl_data,rabbitmq_web_dispatch,[],undefined,{rabbit_web_dispatch_app,[]},[rabbit_web_dispatch,rabbit_web_dispatch_app,rabbit_web_dispatch_registry,rabbit_web_dispatch_sup,rabbit_web_dispatch_util,rabbit_webmachine],[],infinity,infinity},[],0,<0.29.0>}
>
> 0x00007f9c1adc3df8 Return addr 0x00007f9bf9a55108 (application_master:terminate/2 + 192)
> y(0) <0.256.0>
>
> 0x00007f9c1adc3e08 Return addr 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472)
> y(0) []
> y(1) normal
>
> 0x00007f9c1adc3e20 Return addr 0x00007f9bf9a54a60 (application_master:main_loop/2 + 1600)
> y(0) <0.7.0>
> y(1) #Ref<0.0.20562.258360>
> y(2) Catch 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472)
>
> 0x00007f9c1adc3e40 Return addr 0x00007f9bfb969420 (proc_lib:init_p_do_apply/3 + 56)
> y(0) <0.7.0>
>
> 0x00007f9c1adc3e50 Return addr 0x00000000008827d8 (<terminate process normally>)
> y(0) Catch 0x00007f9bfb969440 (proc_lib:init_p_do_apply/3 + 88)
> -------------------------
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
More information about the erlang-bugs
mailing list