Network partition and OTP

Asko Husso etxhua@REDACTED
Wed Apr 30 09:49:19 CEST 2003


In the AXD301 product this problem is handled by setting
the kernel flag 'dist_auto_connect' to 'once'.

Why You may ask?

Because the most ugliest (perverted?) thing that can happen is
if one uses the automatic node reconnect feature and have flipping
communication (down->up->down and so on). This can really screw up the
distributed applications (global, dist_ac).

Before we changed to dist_auto_connect'=='once' we could see some
systems (at customer site) that were totally screwed up in the dist_ac.
We could only pray that this disastrous situation would escalate to
node restart so everything would clear up.

What happens then if connection can only be setup once?
Well, we have implemented a simple resolve protocol that is activated
between the two nodes that looses connection. (UPD ports always ready
to receive messages, one on each Erlang node).
Both involved nodes makes a decision on which node is more important
and selects the least important node. Minor handshaking and one of the
the nodes is restarted (the least prior. node).
When it comes up again it will reconnect.

This solution have worked quite well and has been enhanced as we found
more ugly cases. We even try to discover which of the two nodes is 
the "guilty" party. For example, if one node looses connection to more
than one node it "must" be guilty. Such case can happen for instance
if there is some huge garbage collect that takes up all execution. In that
case only the "guilty" node is restarted and the other involved nodes
are unharmed.

I feel that the automatic node reconnect feature might be nice
for small systems with very few applications. But it will still
be lot of work to handle the reconnect case correctly. I'm not sure but 
I think that very few have thought about handling this error case.
Haven't seen anything in the OTP documentation about this but then
I seldom read all documentation that carefully..

Asko Husso                       	E-mail:etxhua@REDACTED
Ericsson AB		  	        Phone: +46 8 7192324
Varuvägen 9B                               
S-126 25 Stockholm-Älvsjö, Sweden





More information about the erlang-questions mailing list