[erlang-questions] Help with design of distributed fault-tolerant systems

Vance Shipley vances@REDACTED
Thu Oct 8 14:48:36 CEST 2015


Martin,

A challenge in sharing state between nodes in a cluster is that as the
amount of data, and number of nodes, increases the CPU and IO usage
rises, often to a point where it no longer makes sense.  One thing
we've done to mitigate this is to use multicasting where nodes update
other nodes in one message instead of one message per node as it is
with Erlang distribution.  The multicast solution has the advantage
that as the cluster size increases the traffic goes up linearly
instead of exponentially.  The downside is that it's not reliable
however often best effort is enough and late arrival data is useless
anyway in some cases.  I've often day dreamed about prototyping mnesia
replication using multicast ...

Something I haven't tried, but I see that others have, is using remote
direct memory access (RDMA) such as RoCE(*) which basically syncs a
chunk of memory on a network interface cards which you can read
directly (DMA).  Fun stuff I'm sure.  Probably getting out of hand for
your problem.  :)


(*) https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet


-- 
     -Vance



More information about the erlang-questions mailing list