[erlang-questions] failover pattern

Fri Apr 13 09:47:42 CEST 2007

One possible complication is that the application_controller
supports only two types of distributed application:
- applications running on every node
- applications running on only one node each

If you have multiple instances of an application which 
should failover independently within their processor
pairs, you will need to roll your own version of 
dist_ac.

This is not done over a weekend, but it can be done.
Martin Björklund once wrote a module called gen_dac,
which we adapted to AXD 301, to handle (mainly)
1+1 redundancy where we have an instance of certain
applications running on each pair.

The application_controller has a message passing 
interface, and you can plug in your own controller
module. You might want to look at dist_ac.erl to see
how it's done(*). However, I would propose that you
not write your module quite like dist_ac anyway.

(*) Assuming that this is in fact what you want to do,
but for the sake of others out there, I'll develop the
line of thought either way.

One problem with dist_ac, and for that matter with our
version of den_dac, is that it handles all instances
of state machines for all applications in one process.
This makes it very difficult to understand and debug 
the code.

(This is given brief mention in chapter 4 of 
Thomas Arts' paper on trace analysis of Erlang
programs from the 2002 SIGPLAN Erlang Workshop:
http://www.ituniv.se/~arts/papers/ews2002-2.pdf)

I once started making a cluster controller behaviour,
but didn't quite finish it. The main idea was to use
gen_leader to maintain and distribute a global dictionary
(in fact, this was one of the projects that led to 
the development of gen_leader, and I think that the
gdict example included with gen_leader might be a 
good starting point). Then, on each node,
implement a process that figures out what changes
to make locally, and then keep a separate process 
for the state machine of each controlled application.

I've attached an implementation of the controller for
a single application instance. The state machine is
not that tricky, but does contain a few snags, which
become rather obvious when a single instance is 
isolated.

Here's some code that illustrates how to get 
the controller processes going from the 
main controller (which would e.g. be a gen_leader
callback, where the leader instance controls 
each reconfiguration, and distributes the 
global state to each leader candidate.)

take_control(Apps) ->
    Callbacks = callbacks(),
    [control(A, Callbacks) || A <- Apps].

control(AppName, Callbacks) when atom(AppName) ->
    {ok, Controller} =
        dac_app_ctrl:start_link(AppName, Callbacks),
    {AppName, Controller, dac_app_ctrl}.

callbacks() ->
    Me = self(),
    [{info_started,
      fun(Name) ->
              gen_server:cast(Me, {info_started, Name})
      end},
     {info_stopped,
      fun(Name) ->
              gen_server:cast(Me, {info_stopped, Name})
      end}].

One additional thought was that this model ought to
work reasonably well even for instances that are not
regular OTP applications. You can start a controller
process for each such instance, and instruct it in
the same manner. This is just a thought - I haven't
prototyped it.

One of the reasons why this was left unfinished was
that I wanted to try to make a plugin-compatible
version for AXD 301, but one that would also provide
some added value (such as an adaptive N+k scheme),
but there is legacy stuff in the AXD 301 cluster 
controller that makes sense for the AXD, given the
way it was developed - but which I couldn't easily
generalize. Besides, the saying "if it ain't broke, 
don't fix it" does hold some merit... (:

Gen_leader can be found in jungerl.
http://jungerl.cvs.sourceforge.net/jungerl/jungerl/lib/gen_leader/

BR,
Ulf W

> -----Original Message-----
> From: erlang-questions-bounces@REDACTED 
> [mailto:erlang-questions-bounces@REDACTED] On Behalf Of 
> Garry Hodgson
> Sent: den 12 april 2007 20:43
> To: Erlang
> Subject: [erlang-questions] failover pattern
> 
> i'm looking at putting together a system that will include 
> a/b pairs of machines for each major role, to allow for 
> failover between pairs.  starting to think about doing that 
> in erlang, i can see that all the parts i need are present, 
> but i'm not sure if there are standard ways or best practices 
> to putting this together.
> 
> i expect that each pair of machines would have one of them 
> globally registered with the role they play, and that they'd 
> each link to the other so that they could assume that role as needed.
> maybe something like the "negotiation techniques" in the 
> (old) erlang book.  but i get kind of mired down in the 
> details.  i'm also not sure how this would interact with the 
> supervisors/applications notions in OTP.
> 
> can anyone provide some insight, maybe pointers to papers, 
> tutorials, or code examples?
> 
> thanks
> 
> 
> ----
> Garry Hodgson, Senior Software Geek, AT&T CSO
> 
> nobody can do everything, but everybody can do something.
> do something.
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dac_app_ctrl.erl
Type: application/octet-stream
Size: 5432 bytes
Desc: dac_app_ctrl.erl
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20070413/bf30b25c/attachment.obj>