[erlang-questions] Mnesia - Partitioned network problem

Ulf Wiger ulf.wiger@REDACTED
Mon Sep 5 12:57:48 CEST 2011


On 5 Sep 2011, at 11:48, Erik Seres wrote:

> Due to an intermittent network failure, replication had stopped. In the logs for x@REDACTED was the following message:
> 
> =ERROR REPORT==== 23-Aug-2011::23:56:37 ===
> ** Node 'x@REDACTED' not responding **
> ** Removing (timedout) connection **

Sometimes, this message appears on both sides, and sometimes only on one. It depends a bit on the cause. The other side (if it doesn't timeout and issue the above message) will detect the socket being closed, but this doesn't cause a message to be logged.

> This message went unnoticed until x@REDACTED was restarted for an unrelated reason and it x@REDACTED logged the following error at startup:
> 
> =ERROR REPORT==== 24-Aug-2011::14:22:09 ===
> Mnesia('x@REDACTED'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'x@REDACTED'}

This comes when the nodes try to reconnect. The node that restarted will hail the one that is already running (x@REDACTED), which finds out that both nodes considered the other 'down'. Unfortunately, this event is not issued on both nodes.

That the systems didn't reconnect until x@REDACTED restarted has to do with the fact that you happen to have no processes that actively try to send messages to the other node. By default, Erlang nodes will connect automatically on-demand, but this can be turned off (using -kernel dist_auto_connect X, where X :: once | never). 

Sometimes, people will have a process that simply sends a UDP message to corresponding processes on the other nodes. If such a message is received from a node that you thought was down (not in the nodes() list), then you have a partitioned network, and need to take steps to resolve it. Mnesia doesn't do this for you; but it's really not that hard to supply the logic - once you realise that you have to do it yourself. ;-)

BR,
Ulf W

Ulf Wiger, CTO, Erlang Solutions, Ltd.
http://erlang-solutions.com



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20110905/c9ce4a5d/attachment.htm>


More information about the erlang-questions mailing list