Bug in OTP distributed app failover/takeover?

steve <>
Thu Jan 14 20:12:56 CET 2010


Hi all,

Posted this one a few days ago and got no takers. I'm upping the ante and
suggesting that there is a bug in how OTP application takeover currently
behaves and giving some sample code (see attached).

As I mentioned before (can it truly be that nobody was listening to this
super duper important topic?): it appears that application takeover requires
that all nodes be up for takeover to work without crashing the node that is
attempting the takeover. Which would imply that takeover doesn't tolerate
network partitions.

Or to say it another way:

In a 3 node OTP distributed application: if n1 and n2 are down and n1 (the
high priority node) comes back up, n1 crashes saying that one of its
supervised gen_servers is already started (which really shouldn't be true).

Heart won't help us because n1 does something really bad to n3 when it
crashes: it stops the application on n3. Ouch.

But, if n3 and n2 are both up and n1 comes back online, takeover happens
perfectly.

Wouldn't most agree that the expected behavior would be that takeover would
happen without a crash if one of the nodes was still down or partitioned
(it's kinda the whole point)?

I've attached a sample app. To see the the trouble in action start three
shells and run the start script in each like this ./start.sh n1, ./start.sh
n2 etc.. Kill n1 and n2 and restart n1. You should see n1 crash and n3 stop
the app.

It's worth noting that heart would eventually restart the application on n1
and n3 does eventually rejoin without having to be restarted, but it would
appear that the whole app would be down for some time. Bug? Feature? Bug
that was once a feature?

I'm still holding out hope that I've misunderstood something or
misconfigured something. Or is there something simple I can do to make this
work the way one would expect? Someone please enlighten me.

Your erlang comrade,
Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20100114/195d2444/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dist.tar.gz
Type: application/x-gzip
Size: 56165 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20100114/195d2444/attachment.bin>


More information about the erlang-questions mailing list