[erlang-patches] Fix race in mnesia_monitor - lost/ignored node-up	event
    Jonas Falkevik 
    jonas.falkevik@REDACTED
       
    Thu Aug 29 11:00:14 CEST 2013
    
    
  
Hi,
if network goes down during a short period, approx 40s, a node_down event is generated followed by a node_up event, which is not handled properly.
The node_down & node_up events can be received before the remote linked process by mnesia_monitor is generating an EXIT-message.
And since the mnesia_monitor is handling the node_up event only after the EXIT-message, and some logic to set the mnesia node as down, we have a race. 
Hence network partition is not detected for all cases.
To reproduce the problem I have used two virtual machines and unplugging the cable during approx. 40s.
While doing a net_adm:ping/1 between the nodes. 
I haven't been able to do any automated test-case... yet. 
Please have a look at the following patch to fix the problem, there are most certainly a better way of fixing the race if you dig deeper into or know the mnesia internals.
I can rebase the patch on maint-branch if needed, currently it is based on the master branch.
git fetch git://github.com/falkevik/otp.git mnesia_monitor_nodedown_race_fix
https://github.com/falkevik/otp/compare/master...mnesia_monitor_nodedown_race_fix
https://github.com/falkevik/otp/compare/master...mnesia_monitor_nodedown_race_fix.patch
BRs,
Jonas
    
    
More information about the erlang-patches
mailing list