[erlang-patches] Fix race in mnesia_monitor - lost/ignored node-up event

Thu Aug 29 11:00:14 CEST 2013

Hi,
if network goes down during a short period, approx 40s, a node_down event is generated followed by a node_up event, which is not handled properly.
The node_down & node_up events can be received before the remote linked process by mnesia_monitor is generating an EXIT-message.
And since the mnesia_monitor is handling the node_up event only after the EXIT-message, and some logic to set the mnesia node as down, we have a race. 
Hence network partition is not detected for all cases.

To reproduce the problem I have used two virtual machines and unplugging the cable during approx. 40s.
While doing a net_adm:ping/1 between the nodes. 
I haven't been able to do any automated test-case... yet. 

Please have a look at the following patch to fix the problem, there are most certainly a better way of fixing the race if you dig deeper into or know the mnesia internals.
I can rebase the patch on maint-branch if needed, currently it is based on the master branch.

git fetch git://github.com/falkevik/otp.git mnesia_monitor_nodedown_race_fix
https://github.com/falkevik/otp/compare/master...mnesia_monitor_nodedown_race_fix
https://github.com/falkevik/otp/compare/master...mnesia_monitor_nodedown_race_fix.patch

BRs,
Jonas