[erlang-questions] Distributed OTP Apps: Failover and Takeover

Garret Smith <>
Mon Feb 25 17:58:51 CET 2013


Hi Andreas,

This is due to the failover mechanism you are using, ie marking the app as
'distributed' in sys.config.
This makes the kernel start the dist_ac controller to manage your app.
Unfortunately, the dist_ac controller is looking for nodedown, not app
dieing.
Reading carefully on one of the pages your writeup links to (
http://learnyousomeerlang.com/distributed-otp-applications)

"The only thing they do is wait for the node of the running application to
die. This means that when the node that runs the app dies, another node
starts running it instead.

The dist_ac controller is also susceptible to net splits.  If the network
connection between your 2 nodes goes down, the app will be running on both
nodes.  Once the network heals again, dist_ac doesn't know what to do about
it.  I see in your example both nodes are on the same machine, so you
aren't susceptible to this limitation.

I got bit by these problems a while back.  I eventually had to roll my own
distributed app controller based on gen_leader:
https://github.com/garret-smith/gen_leader_revival

-Garret Smith


On Mon, Feb 11, 2013 at 4:31 AM, Andreas Pauley <> wrote:

> Hi everyone,
>
> I've made a demo app of mine distributed to test failover and
> takeover, after reading the "Distributed OTP Applications" chapter in
> Learn you some Erlang.
>
> The failover and takeover works great if I kill the running beam (eg.
> with kill -9).
>
> However, I tried sending kill signals to the Pids of both my
> application behavior and the top supervisor that gets started by the
> application.
> This crashes the VM, but failover does not happen.
>
> Is this unsupported, or should I do something to enable failover in
> this scenario?
>
> I've done a more complete writeup with code and output here:
>
> https://github.com/apauley/dark-overlord#when-processes-die-a-guide-to-the-afterlife
>
> But in a nutshell, I crash the VM with the commands below, and then
> automatic failover to my second node does not happen:
>
> $ ./rel/overlord/bin/overlord console
> Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:8:8]
> [async-threads:0] [hipe] [kernel-poll:false] [dtrace]
>
> 14:03:20.581  [overlord_app] <0.56.0> || Starting app:
> normal
> 14:03:20.582  [hypnosponge_sup] <0.57.0> || Hello
> from the hypnosponge supervisor
> ()1> Sup = pid(0, 57, 0).
> <0.57.0>
> ()2> exit(Sup, kill).
>
> =ERROR REPORT==== 11-Feb-2013::14:04:46 ===
> ** Generic server minion_supersup terminating
> ** Last message in was {'EXIT',<0.57.0>,killed}
> ** When Server state == {state,
>                             {local,minion_supersup},
>                             simple_one_for_one,
>                             [{child,undefined,minion_makeshift_sup,
>                                  {minion_makeshift_sup,start_link,[]},
>                                  temporary,5000,worker,
>                                  [minion_makeshift_sup]}],
>                             undefined,1,3,[],minion_supersup,[]}
> ** Reason for termination ==
> ** killed
> true
> ()3>
> =INFO REPORT==== 11-Feb-2013::14:04:46 ===
>     application: overlord
>     exited: killed
>     type: permanent
>
> ()3> {"Kernel pid
>
> terminated",application_controller,"{application_terminated,overlord,killed}"}
>
> Crash dump was written to: erl_crash.dump
> Kernel pid terminated (application_controller)
> ({application_terminated,overlord,killed})
>
> --
> http://pauley.org.za/
> http://twitter.com/apauley
> http://www.meetup.com/lambda-luminaries/
> _______________________________________________
> erlang-questions mailing list
> 
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130225/7687ba00/attachment.html>


More information about the erlang-questions mailing list