[erlang-questions] Re: Distributed application takeover

Jeroen Koops <>
Thu Mar 31 14:36:33 CEST 2011

Hi Ulf,

Thanks for the explanation, the message-flow is now pretty clear.

Actually, the reason I was looking into it was to see if I could come up
with a patch that makes it possible to configure an alternative takeover
behaviour, since I find it inconvenient that during a takeover the
application is running on two nodes at the same time for a while, which
leads to clashes with globally registered names, and problems with processes
that assume they have exclusive access to data. It would be nice to be able
to configure takeover behaviour in such a way that the application is only
started on the node that is taking over *after *it has been stopped on the
original node.

 However, I've come to the conclusion that the whole application
distribution is way too complex to start poking around, so instead I now use
a mechanism where, whenever an application is started with { takeover,
OtherNode } as StartType, it simply terminates all the children of the
application's top-level supervisor on OtherNode before proceeding with the
startup. This is a bit clunky, and a feel a bit sorry for the application on
OtherNode being left behind as an empty shell, with all its children killed,
waiting for the application controller to put it out of its misery, but it
gets the job done...

On Wed, Mar 30, 2011 at 4:00 PM, Ulf Wiger

> Hi Jeroen,
> This is a bit tricky, and can be very confusing for someone trying to build
> their own cluster controller. :)
> The stopping of the application is done by the local application_controller
> when it is told that the application is running somewhere else. In other
> words, dist_ac kindly forwards the information to the local AC, and it is
> the local AC that in this particular case takes responsibility for 'knowing'
> that the local instance should be stopped.
> See application_controller:handle_application_started/3
> https://github.com/erlang/otp/blob/OTP_R14B02/lib/kernel/src/application_controller.erl#L913
> (The dirty deed actually happens on line 946).
> The reason I know this, is that in the AXD 301, we ran multiple instances
> of some applications, distributed across several mated pairs - each instance
> having its own standby node. This can be done by writing an own distributed
> AC, but it has to be smart enough to know _when_ to forward the {started,
> Node} status to the local AC; if an instance was running locally - and _was
> supposed to_ do so (i.e. not involved in a takeover), the distributed AC had
> to suppress this message.
> BR,
> Ulf W
> On 30 Mar 2011, at 14:11, Jeroen Koops wrote:
> Hi,
> I'm trying to find out how the distributed application controller works
> internally. I'm especially interested in the implementation of an
> application takeover.
> In case an application runs on node A, and is taken over by node B, what
> should happen is that it is first started on node B, so that there are two
> instances of the application running simultaneously for a brief period of
> time, and then stopped on node A.
> However, I cannot figure out where this stopping happens in dist_ac.erl.
> If I understand correctly, this should happen in response to a
> ac_application_run message from the application_controller. This message
> is received by the dist_ac on node B, and a dist_ac_app_started message is
> then broadcast to the dist_acs on all connected nodes. The dist_ac of node
> A receives this message, notices that the application is still running
> locally, and decides to shut down the application on its own node -- at
> least that is what the comments say (dist_ac.erl, line 529):
> %% Another node tookover from me; stop my application
> %% and update the running list.
> But all I can see is that the dist_ac's list of applications is updated to
> indicate that the application is no longer running locally -- I cannot find
> where the application_controller is instructed to actually shutdown the
> application.
> Can anyone point me in the right direction?
> Thanks,
> Jeroen _______________________________________________
> erlang-questions mailing list
> http://erlang.org/mailman/listinfo/erlang-questions
> Ulf Wiger, CTO, Erlang Solutions, Ltd.
> http://erlang-solutions.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20110331/b73f5785/attachment.html>

More information about the erlang-questions mailing list