[erlang-bugs] dist_ac deadlock
Decker, Nils
n.decker@REDACTED
Tue Nov 27 13:37:55 CET 2007
Hello everyone,
is sent this message last week, but it does not appear in the mailinglist archive. Please excuse possible duplicates.
I have found a deadlock in the dist_ac process during a failover of a distributed application on 4 nodes.
I am using Release R11B2 (debian stable).
R11B5 (11.b.5dfsg-8 debian testing) shows the same behaviour.
The situation:
- 2 distributed applications defined to run on 4 nodes
with the same priority.
{distributed, [
{cmm_adm, [{'cmm1@REDACTED', ... 'cmm4@REDACTED'}]},
{cmm_db, [{'cmm1@REDACTED', ... 'cmm4@REDACTED'}]}
]}
- takeover to node4 using application:takeover(cmm_db, permanent),
same for cmm_adm
- kill this node using ^C a
Observations:
- The application is started on node1
- application:info() shows the cmm_db application as running
on node1
- application:info() on the other nodes show the applications
running on the killed node
- dist_ac:info() times out
I generated a crashdump on node 1 - 3.
dist_ac (0.17.0) on node1:
- Program counter: gen_server:loop
- Msg Queue Length: 0
- state of the application ( from stack ): local
dist_ac (0.17.0) on node2:
- Program counter: dist_ac:collect_answers/4
- Msg Queue Length: 5
{'EXIT',<0.183.0>,normal}
{internal_restart_appl,cmm_adm}
{'EXIT',<0.184.0>,normal}
{dist_ac_weight,cmm_adm,10,'cmm1@REDACTED'}
{nodedown,'cmm3@REDACTED'}
- state of the application ( from stack ):
{failover,'cmm4@REDACTED'},
dist_ac (0.17.0) on node3:
- Program counter: dist_ac:collect_answers/4
- Msg Queue Length: 6
{'EXIT',<0.157.0>,normal}
{internal_restart_appl,cmm_adm}
{'EXIT',<0.158.0>,normal}
{dist_ac_weight,cmm_db,10,'cmm2@REDACTED'}
{dist_ac_weight,cmm_db,10,'cmm2@REDACTED'}
{dist_ac_weight,cmm_adm,10,'cmm1@REDACTED'}
- state of the application ( from stack ):
{failover,'cmm4@REDACTED'},
The comment before the function dist_ac:collect_answers/4 states that dist_ac must always be prepared to handle dist_ac_weight messages. Yet collect_answers does not handle this message.
This seems to be a problem, if more than two nodes have the same priority for an application.
I have changed my application to have different priorities on the nodes.
(list of nodes instead of tuple with nodes) This works but has the drawback that a takeover is performed when the failed node comes back online. (forcing the re-initialisation of the communication to a number of external systems)
I could provide crashdumps or try to reproduce the error with a minimal application if needed.
Regards
Nils Decker
--
- MCI alles aus einer Hand -
_____________________________
Nils Decker
Projektierung
Studio Hamburg Media Consult International (MCI) GmbH Jenfelder Allee 80
22039 Hamburg
phone: +49 (0)40 66 88 34 37
fax: +49 (0)40 66 88 52 22
E-mail: n.decker@REDACTED
Web: www.mci-broadcast.com
Geschäftsführung: Ralf Schimmel
Prokuristen: Jörn Denneborg, Jörg Pankow Amtsgericht Hamburg, HRB 70454
More information about the erlang-bugs
mailing list