Using failover

Ulf Wiger etxuwig@REDACTED
Thu Feb 10 18:11:51 CET 2000

> Date: 10 Feb 2000 17:29:45 +0100
> From: Samuel Tardieu <sam@REDACTED>
> Due to major power trouble in my building, I built an application to
> monitor those power failures. Since this application needs to be
> fault tolerant, as its results are used by the technicians working on
> the power outages, I use the "distributed" kernel parameter.

>From your later mail it appears as if you'd like to perform state
transfer as well. Here goes:

> It is not clear to me what I should do in takeover mode. My
> application is mainly a globally registered gen_server. How can I
> cleanly shut down the server running on the other node (the one I'm
> taking over) and make sure it is done before starting the application
> locally? Won't this create a race condition where the application
> won't be restarted if the top-priority node dies after stopping the
> application on the remote node and before registering the new process
> locally?

If you use global:re_register_name(Name, NewPid), the new instance of
your process will simply take over the name, and calls to the global
server will be re-routed to the new instance.

The old application instance is shut down automatically when the 
new application instance is fully started.

Here's a simple but relatively safe way of doing things:

1. First, add the following attribute to your app file
   (see erl -man application):

%% This activates a phased start of the application. Mod:start/2 is
%% always the first function to be called; then the functions in the 
%% start_phases list will be called in order. Syntax: [{Fun, Args}]
%% which leads to the call Mod:Fun(Type, Args) (Mod as specified in the
%% 'mod' attribute. Using this attribute, you may also get 
%% Type = {failover, Node} (it's done this way for BW compat reasons)
{start_phases, [{go, []}]},

2. Modify pomonitor_app.erl to include a callback for the go/2 phase:

-module (pomonitor_app).

-behaviour (application).
-export ([start/2, go/2, stop/1]).

start (normal, _) ->
    pomonitor_sup:start_link ();

start ({failover, _}, _) ->
    pomonitor_sup:start_link ();

start ({takeover, _}, _) ->
    pomonitor_sup:start_link ().

go({takeover, FromNode}, _) ->
go(_, _) ->   % Type = normal | {failover, FromNode}

stop (_) -> ok.

3. Write a function to handle the takeover:

pomonitor.erl  (assuming this is a globally registered gen_server):

perform_takeover(FromNode) ->
   gen_server:call(pomonitor, {perform_takeover, FromNode}).


init(_) ->
   %% Need to check first before registering a global name.
   %% One way to do this is to use application:start_type() to find out 
   %% whether the application is starting, or if it's a local process crash
   %% but this is not entirely safe. We could be restarting from a process
   %% crash on the retiring side of a takeover, after having passed on our
   %% state, but before shutting down. If this is the case, we MUST not
   %% re-register. Here we use global:safe_whereis_name/1 (not whereis_name/1
   %% because we must send a message to global, giving it a chance to unreg
   %% me if I just crashed and am restarting.
   case global:safe_whereis_name(pomonitor) of
      undefined ->
         %% this is most likely a local process restart
         global:re_register_name(pomonitor, self());
      _ ->
         %% there is another globally registered instance
         %% most likely a takeover in progress. Wait for takeover msg.
   {ok, #state{}}.

handle_call({perform_takeover, FromNode}, From, State}) ->
   %% Cute detail of takeover. I first re-register, stealing the name; then
   %% I ask for the state. Pending calls from clients will be serviced
   %% on the other side; new calls (after my re_register_name()) will be 
   %% buffered by me until I have the new state; afterwards, all calls will
   %% be serviced by me; the old instance can most likely just sit there and
   %% wait for its application to terminate.
   global:re_register_name(pomonitor, self());
   NewState = gen_server:call({pomonitor, FromNode}, takeover_state),
   {reply, ok, NewState};
handle_call(takeover_state, From, State) ->
   {reply, State, State};


Ulf Wiger, Chief Designer AXD 301         <ulf.wiger@REDACTED>
Ericsson Telecom AB                          tfn: +46  8 719 81 95
Varuvägen 9, Älvsjö                          mob: +46 70 519 81 95
S-126 25 Stockholm, Sweden                   fax: +46  8 719 43 44

More information about the erlang-questions mailing list