[erlang-patches] Fix race in mnesia_monitor - lost/ignored node-up event

Thu Aug 29 11:43:18 CEST 2013

On 08/29/2013 11:00 AM, Jonas Falkevik wrote:
> Hi,
> if network goes down during a short period, approx 40s, a node_down event is generated followed by a node_up event, which is not handled properly.
> The node_down&  node_up events can be received before the remote linked process by mnesia_monitor is generating an EXIT-message.
> And since the mnesia_monitor is handling the node_up event only after the EXIT-message, and some logic to set the mnesia node as down, we have a race.
> Hence network partition is not detected for all cases.
>
> To reproduce the problem I have used two virtual machines and unplugging the cable during approx. 40s.
> While doing a net_adm:ping/1 between the nodes.
> I haven't been able to do any automated test-case... yet.
>
> Please have a look at the following patch to fix the problem, there are most certainly a better way of fixing the race if you dig deeper into or know the mnesia internals.
> I can rebase the patch on maint-branch if needed, currently it is based on the master branch.
>
> git fetch git://github.com/falkevik/otp.git mnesia_monitor_nodedown_race_fix
> https://github.com/falkevik/otp/compare/master...mnesia_monitor_nodedown_race_fix
> https://github.com/falkevik/otp/compare/master...mnesia_monitor_nodedown_race_fix.patch
>
>
> BRs,
> Jonas
> _______________________________________________
> erlang-patches mailing list
> erlang-patches@REDACTED
> http://erlang.org/mailman/listinfo/erlang-patches
Hello Jonas,
I've fetched your patch and assigned your patch to be reviewed by 
responsible developer.
Thanks,

-- 

BR Fredrik Gustafsson
Erlang OTP Team