[erlang-questions] (noob-help) Supervision strategies to automatically restart dynamically added children

Edmond Begumisa <>
Mon Mar 7 00:38:09 CET 2011


Hi Dhananjay,

I too struggled with this exact question for quite some time so I'll chime  
in here on the two techniques I used to solve it...

On Thu, 03 Mar 2011 05:02:06 +1100, Dhananjay Nene  
<> wrote:

> While supervisors are meant to automatically restart failed processes,
> there is one scenario I am as yet unable figure out which is the
> idiomatic approach to implement crash recovery under the default OTP
> scenarios. I have considered a solution, but being a relative newbie,
> I am not sure if it is idiomatic erlang and if there are better
> solutions.
>
> Question in short : If I have a supervisor which has a number of
> dynamic children, how do I set up a mechanism where in case of a
> complete system crash, all the dynamic children restart at the point
> they were when the system (including the supervisor) crashed.
>
> Question in long :
> =============
>
> Sample Context : A bowling game
> -------------------------------------------------
>
> Lets say I am writing the software to implement the software necessary
> to track various games at a bowling alley. I've set up the following
> processes :
>
> a. Lanes : If there are 10 lanes, there are 10 processes, one for each
> lane. These stay fixed for the entire duration of the program
> b. Games : A group of players might get together to start a game on a
> free lane. A new game will get created to track the game through its
> completion. When the game is over, this process shall terminate
> c. Players : Each game has a number of players. One process
> "player_game" is started per player. Sample state of a player game
> would include current score for the player and if the last two rolls
> were strike or a spare. For the purpose of brevity, the remainder of
> this mail only refers to this process and ignores the others
>

You could reduce complexity by having each lane process maintain it's  
current game (players and scores) as part of it's state. The game and  
player_game processes appear unnecessarily confusing to me.

> Objective :
> ---------------
>
> Assuming this is a single node implementation, if the machine were to
> crash, upon machine / node restart, all the player_games should be
> restarted and should be at the point where the player_games were when
> the machine crashed.
>
> Possible supervision strategy :
> --------------------------------------
>
> 1. Create a simple_one_for_one supervisor player_game_sup which upon
> starting up for the first time would have no children associated with
> them. Use supervisor:start_child to start each process
> 2. The supervisor creates an entry in a database (say mnesia) every
> time it launches a new process
> 3. Each player_game updates the entry every time the score gets
> modified. Upon termination that entry gets deleted
> 4. Post crash, the supervisor is started again (say after an
> application restart or via another supervisor)
> 5. (Here's the difference). By default the supervisor will not restart
> the dynamically added children (all the player_games). However we
> modify the init code to inspect the database and launch a player_game
> for each record it finds.

How? I don't think you can instruct a simple_one_for_one supervisor to  
create children from it's init/1 callback. From the documentation...

http://www.erlang.org/doc/man/supervisor.html#Module:init-1

"...No child process is then started during the initialization phase, but  
all children are assumed to be started dynamically using  
supervisor:start_child/2..."

Even if you switched to one_for_one with no child specs, I don't think  
you'd be able to call supervisor:start_child/2 from init/1 of the same  
supervisor since this function is called before the supervisor has  
finished initialising itself and it's the actual supervisor process doing  
the calling. You're likely to wait forever.

AFIAK, creating dynamic children (calling supervisor:start_child/2) has to  
be done after the supervisor has initialised by a process other than the  
supervisor process.

This is normally not a problem if you are calling start_child/2 during the  
"normal" operation of the application because the supervisor in question  
is likely to already be up. But here, you want to call start_child/2 at  
*startup*. From my experience with this precise matter, this requires some  
process coordination.

> The player_game initialises itself to the
> current state as in the database and the game(s) can continue where
> it/they left off.
>
> My questions :
> --------------------
> a. Does it make sense to move the responsibility to the supervisor to
> update the database each time a new player game is started or
> completed ?

I personally don't see the advantage of doing this. Besides (as per my  
understanding of OTP design principles), a supervisor's job should be just  
that -- supervising workers and not doing work itself.

Doing this from the your worker gen_servers make more sense to me and  
seems more natural. i.e Reading the scores from the DB the during  
player_game:init and writing them every time a score gets bumped or  
something similar.

> b. Is it an idiomatic way to implement crash recovery

There is none. It's very application specific as Jesper has indicated.

I've come across a couple of wide patterns, but the details of where to  
put checkpoints can't be generalised. For instance; although you are  
specifically asking about a single node, multi-node hot take-over with no  
DB/persistence is another way. I was recently privy to a very interesting  
discussion on that technique. You might want to check it out for a future  
project...

http://thread.gmane.org/gmane.comp.lang.erlang.general/50258/focus=50269

> c. Are there any other perhaps superior ways of implementing this?
>

I don't know about superior, I just don't think your first suggestion will  
actually work. I can offer of 2 possibilities each of which I've used...

Possible supervision strategy 2a: (Loader version)
--------------------------------------------------

Rather than separate dynamic children for players and games as in Strategy  
1, instead, each lane stores, as part of it's state, info on the current  
game (the players playing on the lane and their state/scores). The  
supervision tree might look like this...

            alley_sup
           /         \
   lane_ldr  ___lanes_sup_____
            /       |     :   \
         lane(1)  lane(2) .. lane(N)

* Application has a startup configuration parameter no_of_lanes which  
comes from a conf file or the .app file and loaded by the alley_sup...

=== bowling_app.app ===
{application, bowling_app,
  [{..
    {env,[{no_of_lanes,10}]},
    ..}]}.

=== alley_sup.erl ===
-behaviour(supervisor).
..
init([]) ->
     {ok, No_Of_Lanes} = application:get_env(no_of_lanes),
     {ok, {{one_for_one, 1, 30},
        [{lanes_sup,
             {lanes_sup, start, []},
              permanent,
              infinity,
              supervisor,
              [lanes_sup]},
         {lanes_ldr,
             {lanes_ldr, start, [No_Of_Lanes]},
              temporary, % Starts lanes_sup children then disappears
              6000,
              worker,
              [lanes_ldr]}]}}.

* lane_sup is a simple_one_for_one supervisor of any number of lanes but  
initially has none.
* Now here is the trick: lane_ldr is a gen_server is initialised with  
No_Of_Lanes. It's job is to call supervisor:start_child No_Of_Lanes times  
at startup then vanish...

=== lane_ldr ===
-behaviour(gen_server).
..
init(No_Of_Lanes) when No_Of_Lanes >= 1 ->
     case start_lanes(No_Of_Lanes, 0) of
         No_Of_Lanes ->
             io:format("All lanes failed to init -- quitting  
application.~n"),
             {stop, all_lanes_failed}; % Cause alley_sup to quit abnormally
         _ ->
             io:format("Lane loader exiting.~n"),
             ignore % One or more lanes init'ed; loader's work is done.
     end.

start_lanes(0, E) ->
     E; % Return no. of lanes that have failed to init
start_lanes(N, E) ->
     case supervisor:start_child(lanes_sup, [N]) of
         {ok, _} ->
             io:format("Started lane ~w.~n", [N]),
             start_lanes(N - 1, E);
         Err ->
             io:format("Error starting lane ~w: ~p.~n", [N, Err]),
             start_lanes(N - 1, E + 1)
     end.

%%% These are just placeholders for compiler warnings/dialyzer

handle_call(void, _, void) ->
     {noreply, void}.

handle_cast(void, void) ->
     {noreply, void}.

handle_info(void, void) ->
     {noreply, void}.

terminate(_, _) ->
     ignore.

code_change(_, void, _) ->
     {ok, void}.

* Whenever a lane is started by the sup, it loads the most recent game  
 from the DB, or just a simple text file (lane_1.game_state,  
lane_2.game_state, etc -- not a big deal if a text file gets corrupted and  
a game is lost so a DB might be overkill). Possibly something along the  
lines of...

=== lane.erl ===
-behaviour(gen_server).
..
-record(player_state, {frame = 0, % NB: Removed player_name
                        shot = 1,
                        bonus_shot = false,
                        last_shot = normal,
                        prior_to_last_shot = normal,
                        max_pins = 10,
                        score = 0}).

start(Id) ->
     gen_server:start_link(?MODULE, Id, []).

init(Id) ->
     process_flag(trap_exit, true),
     Path = filename:join(code:priv_dir(bowling_app),
                          "lane_" ++ integer_to_list(Id) ++ ".game_state"),
     % Game State is a proplist of player_state records with players' name  
as key
     %    [{Player_Name1, #player_state{}}, {Player_Name2,  
#player_state{}}, .. ]
     {ok, Game_State} = try read_game_state(Path)
                        catch
                             _:{badmatch, {error, enoent}} -> % File not  
found
                                 {file:write_file(Path, "[]."), []};
                             _:Err ->                         % Discard bad  
state
                                 io:format("Zeroing corrupt game file ~s:  
~p~n.",
                                             [Path, Err]),
                                 {file:write_file(Path, "[]."), []}
                        end,
     {ok, {Game_State, Path, ..maybe some non-persisted state..}}.

%% Assert the happy-case for good game state when reloading it
read_game_state(Path) ->
     {ok, [Game_State]} = file:consult(Path),
     true = is_list(Game_State),
     lists:foreach(fun({Player_Name, Player_State}) ->
                     true = is_list(Player_Name),
                     true = is_record(Player_State, player_state),
                     % Maybe do some other checks
                     ok
                   end, Game_State),
     {ok, Game_State}.
..

NB: You'd probably use error_logger instead of all the io:formats.

* Now whenever the score gets bumped, or a new game is starts, or a game  
is concluded, the lane process writes the game state to your DB, or text  
file. For the simple text file, you could just keep calling...

write_game_state(Path, Game_State) ->
     ok = file:write_file(Path, io_lib:format("~p.", [Game_State])).

Possible supervision strategy 2b: (Start Phase version)
-------------------------------------------------------

I was tipped-off by Ulf Wiger on this thread...

http://thread.gmane.org/gmane.comp.lang.erlang.general/48307/focus=48324

... that the initailsiation/coordination done by lane_ldr in 2a above is  
precisely what the start phases feature of included applications is for!  
This requires splitting the application into two, but could be make things  
more manageable for larger applications. So one could get rid of lane_ldr  
and modify 2a to get something like...

            alley_sup
                |
   bowling_app  |
- - - - - - - -|- - - - - - - -
   lanes_app    |
                |
        ___lanes_sup_____
       /       |     :   \
   lane(1)  lane(2) .. lane(N)

* Split everything into two apps: the primary bowling_app and the included  
lanes_app.
* The primary application would be pretty bare, and would start lanes_sup  
as if it were one of it's own modules...

=== bowling_app.app ===
{application, bowling_app,
  [..
   {mod, {application_starter,[bowling_app,[]]}},
   {included_applications, [lanes_app]},
   {start_phases, [{init,[]}, {go,[]}]}
   ..
  ]}.

=== bowling_app.erl ===
-behaviour(application).
..
%% Called on application:start
start(normal, StartArgs) ->
     alley_sup:start(StartArgs).

%% Called *after* entire sup tree is initialised
start_phase(init, normal, []) ->
     % If there's a DB, initialise it here
     ok;
start_phase(go, normal, []) ->
     ok.
..

=== alley_sup.erl ===
-behaviour(supervisor).
..
init([]) ->
     {ok, {{one_for_one, 1, 30},
        [{lanes_sup,
             {lanes_sup, start, []},
              permanent,
              infinity,
              supervisor,
              [lanes_sup]}]}}. % Mod of included app.

* Nothing else is needed in the primary app.
* The second application will be responsible for spawning the dynamic  
children on startup...

=== lanes_app.app ===
{application, lanes_app,
  [..
   {env,[{no_of_lanes,10}]},
   {mod,{lanes,[]}},
   {start_phases, [{init,[]}, {go,[]}]}
   ..
  ]}.

=== lanes_app.erl ===
-behaviour(application).
..
%% NOT called
start(normal, StartArgs) ->
     lanes_sup:start(StartArgs).

%% Called *after* entire sup tree is initialised
%% and corresponding bowling_app:start_phase
start_phase(init, normal, []) ->
     ok;
start_phase(go, normal, []) ->
     {ok, No_Of_Lanes} = application:get_env(?MODULE, no_of_lanes),
     true = No_Of_Lanes >= 1,
     case start_lanes(No_Of_Lanes, 0) of
         No_Of_Lanes ->
             io:format("All lanes failed to init -- quitting  
application.~n"),
             {error, all_lanes_failed}; % Cause app to quit abnormally
         _ ->
             ok % One or more lanes init'ed, continue.
     end.

start_lanes(0, E) ->
     E; % Return no. of lanes that have failed to init
start_lanes(N, E) ->
     case supervisor:start_child(lanes_sup, [N]) of
         {ok, _} ->
             io:format("Started lane ~w.~n", [N]),
             start_lanes(N - 1, E);
         Err ->
             io:format("Error starting lane ~w: ~p.~n", [N, Err]),
             start_lanes(N - 1, E + 1)
     end.

=== lanes_sup.erl ===
Same as in Strategy 2a

=== lane.erl ===
Same as in Strategy 2a

Strategy 2b is cleaner to me than Strategy 2a, even though it requires  
splitting an application into two which many people seem to have a problem  
with.

- Edmond -


> FWIW : the code I am using to learn erlang is at
> https://github.com/dnene/bowling . Its not particularly interesting at
> this stage since it is still under development.
>
> Thanks
> Dhananjay
>
> PS: Apologies for posting it to erlang-questions after earlier posting
> it to erlang programming google group. Those monitoring the latter
> will receive this question twice.
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:
>


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/


More information about the erlang-questions mailing list