Low pain massively multiplayer systems: Peer review requested
Mon Sep 26 14:56:34 CEST 2005
Joel Reymont wrote:
>> How do you handle partitioned netowrks?
> I'm not. What are the conditions under which Mnesia partitions?
I suggest that you search the mailing list for "mnesia partition"
keywords. This issue has been mentioned several times. For example see
Network partitioning happens when there is a loss of connectivity
between two nodes, they drop a timed-out connection, and then, when the
network is healed, they detect connectivity (say, when some process
tries to send a distributed message) and reconnect. Network
partitioning doesn't create a problem just for Mnesia, but also for
global and dist_ac (perhaps other apps as well?). The suggested
approach is to set 'dist_auto_connect' kernel option to 'once', and use
custom UDP ping protocol on partitioned nodes, so that when such a
condition is detected, determine a master node, and restart the others,
so that they would have it's state replicated from the master at startup.
Due to this limitation, when you have master and replica mnesia tiers,
in cases of network partitioning, you would have to restart replica
nodes (therefore making these nodes unavailable to serve client requests
for a period of time until they would come up and resynch mnesia data).
You would probably need some randomization so that not all replicated
nodes would get restarted simultaneously (to avoid complete interruption
I found it very convenient to test network partitioning by having two
servers with dual interfaces connected to two switches like this:
| Controlling terminal |
| Switch1 | | Switch2 |
| | | |
| | +-------+ | |
| +-+ Host1 +-+ |
| +-------+ |
| +-------+ |
+-------+ Host2 +-------+
Run Erlang distribution on Host1 and Host2 and force it to use
'inet_dist_use_interface' option to only bind to the interface connected
on Switch2, so that you can bring this interface up and down without
affecting hosts' connectivity to the Controlling Terminal client.
Switches 1&2 could be replaced by hubs, though use of CISCO 3750 routers
instead would allow to test other interesting features (like HSRP) to
increase network resilience to faults.
More information about the erlang-questions