Network partition and OTP

Reto Kramer kramer@REDACTED
Tue Apr 29 22:14:22 CEST 2003


I'm looking for information on how OTP behaves when the network between 
nodes fails, and reconnects (nodes stay up all the time).

** Question 1 **
In particular the behavior of "global", the "distributed application 
controller" and Ulf's "locker" (contrib page) is what I'd like to 
understand better in network partition/reconnect scenarios.

I've found references to work of Thomas Arts et al [1,2] and Ulf Wiger 
[3] and snippets here and there, but it would be most helpful to me if 
an OTP wizard could illuminate this topic comprehensively.

For "global" one has to expect "name conflict" errors when the network 
comes back together. By extension I guess the same applies to the 
application controller (via it's use of global).  Not sure about Ulf's 
locker.  Using Ulf's release handling tutorial example, I can generate 
a naming conflict and observe what happens (start n1 then n2 (owner), 
suspend erl process that runs n2, dist fails over to n1, then resume 
erl that runs n2, ping n1 -> naming conflict, kills dist_server on n2, 
supervisor restarts n2 which takes over from n1 - takeover handshake 
not logged - does it happen?).

=INFO REPORT==== 29-Apr-2003::12:59:39 ===
global: Name conflict terminating {dist_server,<1930.59.0>}

** Question 2 ** is there any risk of loosing messages that were 
buffered by the dist_server instance just before it got killed?  I'm 
worried that while the global:register etc call are atomic across nodes 
[docs and 2], a potential client (client of dist_server I mean here) is 
not part of the atomic conflict resolution/re-registering process.

I noticed the "relay" function in Ulf's release handling tutorial [3], 
but am not sure it kicks in when global detects the naming conflict 
upon reconnect - I guess not, correct?

** Question 3 ** - somewhat related to the above:
Is there any library support for "majority voting" and/or "lease 
management" in OTP that I've not discovered yet?  In particular I'm 
interested in rejecting a global:register/2 if the process calling the 
function is not in a node majority-set.

Thanks,
- Reto

References:

Thomas Arts et al [1,2], Ulf Wiger [3]

[1] http://www.ericsson.com/cslab/~thomas/publ2.shtml (resource locker 
case study)
[2] 
http://www.erlang.org/ml-archive/erlang-questions/200107/msg00031.html 
(christian paper)
[3] (OTP release handling tutorial by Ulf) - was on the newsgroup, 
cannot find ref right now

______________________
There are two ways of constructing a software design. One way is to 
make it so simple that there are obviously no deficiencies. And the 
other way is to make it so complicated that there are no obvious 
deficiencies.

C.A.R. Hoare
1980 Turing Award Lecture
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 2741 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20030429/250c8b6d/attachment.bin>


More information about the erlang-questions mailing list