Erlang and high availability

Wed Jun 7 04:59:35 CEST 2006

I'm relatively new to erlang, but I've given this area a good read 
recently so I'll have a crack at this:

The standard erlang distribution mechanism between nodes is TCP, so you 
would be unable to use multiple networks to connect nodes.  However, 
according to the erts user's manual, you can write your own distribution 
driver to do whatever you want.  You could do multiple TCP connections, 
or perhaps try SCTP (google for erlang sctp).  As for monitoring for 
partial link failure, I would be surprised if a driver could not be made 
to supply such information.

The erlang distribution mechanism takes care of heartbeats, so I don't 
see why anything on top of that would need to be done.  See the BIF 
monitor_node to get messages about nodes going down.

Cheers,
Dan.

Loic Domaigne wrote:
> Dear Erlangers,
>
> since several years now I am dealing with scalability, concurrency, 
> high availability and fault tolerance problems using mainstream 
> languages like C/C++/JAVA etc. I think I have a quite reasonable 
> understanding of what can be achieved with those languages, and what 
> their limitations are.
>
> Although it is difficult to learn new tricks to an old dog, I really 
> like to see beyond my own nose. I am interest to solve these problems 
> by using truly different approaches. Erlang looks one of the most 
> promizing language in that regard.
>
> For my first study case, I would like to consider the standard 
> heartbeat problem for a 2 nodes cluster. The cluster is composed of 
> two physically distinct nodes A and B. The nodes may possibly have 
> different hardware architecture/OS (to emphasize portability aspects). 
> The 2 nodes are connected via two physically different networks N and N'.
>
> I'd like to implement a simple heartbeat mechanism that achieves the 
> following:
>
> (*) detect a network failure: no heartbeat received over network N
>     (resp. N') within a (pre-defined) period of time,  but heartbeat
>     received over N' (resp. N) within the same period.
>
> (*) detect a failure of the node A or B: no heartbeat from the
>     corresponding node over both network N and N' within a (pre-defined)
>     period of time.
>
> Ideally, the heartbeat mechanism should use a lightweight protocol 
> (like UDP).
>
>
> The first idea that comes to my mind would be to use gen_udp and 
> implement the protocol from the ground. But that's something I'd like 
> to avoid, since I would have eventually have to manage the 
> architecture differences between the nodes.
>
> Furthermore, I am wondering if they are perhaps neater solutions to 
> this problem. Indeed, Erlang has a built-in mechanism for exchanging 
> message between processes. Second, Erlang already performs heartbeat 
> between connected nodes.
>
>
> I would be thankful for any advises, links to documents or code that 
> would help me to make the first step in the right direction.
>
> Thanks in advance,
> Loic.
>
> N.B> My apologize if you have answered a similar question already. 
> Unfortunately the search function for the erlang.org mailing list 
> archives doesn't work.
>
>