Hi all,<br><br>Posted this one a few days ago and got no takers. I'm upping the ante and suggesting that there is a bug in how OTP application takeover currently behaves and giving some sample code (see attached).<br><br>
As I mentioned before (can it truly be that nobody was listening to this super duper important topic?): it appears that application takeover requires that all nodes be up for takeover to work without crashing the node that is attempting the takeover. Which would imply that takeover doesn't tolerate network partitions.<br>
<br>Or to say it another way:<br><br>In a 3 node OTP distributed application: if n1 and n2 are down and n1 (the high priority node) comes back up, n1 crashes saying that one of its supervised gen_servers is already started (which really shouldn't be true).<br>
<br>Heart won't help us because n1 does something really bad to n3 when it crashes: it stops the application on n3. Ouch.<br><br>But, if n3 and n2 are both up and n1 comes back online, takeover happens perfectly.<br><br>
Wouldn't most agree that the expected behavior would be that takeover would happen without a crash if one of the nodes was still down or partitioned (it's kinda the whole point)?<br><br>I've attached a sample app. To see the the trouble in action start three shells and run the start script in each like this ./start.sh n1, ./start.sh n2 etc.. Kill n1 and n2 and restart n1. You should see n1 crash and n3 stop the app.<br>
<br>It's worth noting that heart would eventually restart the application on n1 and n3 does eventually rejoin without having to be restarted, but it would appear that the whole app would be down for some time. Bug? Feature? Bug that was once a feature?<br>
<br>I'm still holding out hope that I've misunderstood something or misconfigured something. Or is there something simple I can do to make this work the way one would expect? Someone please enlighten me.<br><br>Your erlang comrade,<br>
Steve<br><br><br>