[erlang-bugs] dist_ac deadlock

Tue Nov 27 13:37:55 CET 2007

Hello everyone,

is sent this message last week, but it does not appear in the mailinglist archive. Please excuse possible duplicates.

I have found a deadlock in the dist_ac process during a failover of a distributed application on 4 nodes.

I am using Release R11B2 (debian stable).
R11B5 (11.b.5dfsg-8 debian testing) shows the same behaviour.

The situation:
  - 2 distributed applications defined to run on 4 nodes
    with the same priority.
    {distributed, [
      {cmm_adm, [{'cmm1@REDACTED', ... 'cmm4@REDACTED'}]},
      {cmm_db,  [{'cmm1@REDACTED', ... 'cmm4@REDACTED'}]}
     ]}
  - takeover to node4 using application:takeover(cmm_db, permanent), 
    same for cmm_adm
  - kill this node using ^C a

Observations:

- The application is started on node1
  - application:info() shows the cmm_db application as running
    on node1
  - application:info() on the other nodes show the applications
    running on the killed node
  - dist_ac:info() times out

I generated a crashdump on node 1 - 3. 

dist_ac (0.17.0) on node1:
  - Program counter: gen_server:loop
  - Msg Queue Length: 0
  - state of the application ( from stack ): local

dist_ac (0.17.0) on node2:
  - Program counter: dist_ac:collect_answers/4
  - Msg Queue Length: 5
    {'EXIT',<0.183.0>,normal}
    {internal_restart_appl,cmm_adm}
    {'EXIT',<0.184.0>,normal}
    {dist_ac_weight,cmm_adm,10,'cmm1@REDACTED'}
    {nodedown,'cmm3@REDACTED'}
  - state of the application ( from stack ):
    {failover,'cmm4@REDACTED'},

dist_ac (0.17.0) on node3:
  - Program counter: dist_ac:collect_answers/4
  - Msg Queue Length: 6
    {'EXIT',<0.157.0>,normal}
    {internal_restart_appl,cmm_adm}
    {'EXIT',<0.158.0>,normal}
    {dist_ac_weight,cmm_db,10,'cmm2@REDACTED'}
    {dist_ac_weight,cmm_db,10,'cmm2@REDACTED'}
    {dist_ac_weight,cmm_adm,10,'cmm1@REDACTED'}
  - state of the application ( from stack ):
    {failover,'cmm4@REDACTED'},

The comment before the function dist_ac:collect_answers/4 states that dist_ac must always be prepared to handle dist_ac_weight messages. Yet collect_answers does not handle this message.

This seems to be a problem, if more than two nodes have the same priority for an application.

I have changed my application to have different priorities on the nodes.
(list of nodes instead of tuple with nodes) This works but has the drawback that a takeover is performed when the failed node comes back online. (forcing the re-initialisation of the communication to a number of external systems)

I could provide crashdumps or try to reproduce the error with a minimal application if needed.

Regards
  Nils Decker

--
- MCI alles aus einer Hand -
_____________________________

Nils Decker
Projektierung

Studio Hamburg Media Consult International (MCI) GmbH Jenfelder Allee 80
22039 Hamburg
phone:  +49 (0)40 66 88 34 37
fax:    +49 (0)40 66 88 52 22
E-mail: n.decker@REDACTED
Web:    www.mci-broadcast.com
Geschäftsführung: Ralf Schimmel
Prokuristen: Jörn Denneborg, Jörg Pankow Amtsgericht Hamburg, HRB 70454