[erlang-questions] (noob-help) Supervision strategies to automatically restart dynamically added children

Tue Mar 8 11:43:57 CET 2011

On Tue, 08 Mar 2011 17:52:29 +1100, Dhananjay Nene  
<dhananjay.nene@REDACTED> wrote:

> On Mon, Mar 7, 2011 at 5:08 AM, Edmond Begumisa
> <ebegumisa@REDACTED> wrote:
>> Hi Dhananjay,
>>
>> I too struggled with this exact question for quite some time so I'll  
>> chime
>> in here on the two techniques I used to solve it...
>> On Thu, 03 Mar 2011 05:02:06 +1100, Dhananjay Nene
>> <dhananjay.nene@REDACTED> wrote:
>>
>>>
>>> Question in short : If I have a supervisor which has a number of
>>> dynamic children, how do I set up a mechanism where in case of a
>>> complete system crash, all the dynamic children restart at the point
>>> they were when the system (including the supervisor) crashed.
>>>
>>> Question in long :
>>> =============
>>>
>>> Sample Context : A bowling game
>>> -------------------------------------------------
>>>
>>> Lets say I am writing the software to implement the software necessary
>>> to track various games at a bowling alley. I've set up the following
>>> processes :
>>>
>>> a. Lanes : If there are 10 lanes, there are 10 processes, one for each
>>> lane. These stay fixed for the entire duration of the program
>>> b. Games : A group of players might get together to start a game on a
>>> free lane. A new game will get created to track the game through its
>>> completion. When the game is over, this process shall terminate
>>> c. Players : Each game has a number of players. One process
>>> "player_game" is started per player. Sample state of a player game
>>> would include current score for the player and if the last two rolls
>>> were strike or a spare. For the purpose of brevity, the remainder of
>>> this mail only refers to this process and ignores the others
>>>
>>
>> You could reduce complexity by having each lane process maintain it's
>> current game (players and scores) as part of it's state. The game and
>> player_game processes appear unnecessarily confusing to me.
>>
>
> Interesting point. The lanes are the only static aspects of the game.
> I tried to consider whether it would make any difference from a client
> API perspective, but I imagine for a client, there is no particular
> reason to believe a lane is a better or worse abstraction than a game
> (or a player_game).
>
>>> Objective :
>>> ---------------
>>>
>>> Assuming this is a single node implementation, if the machine were to
>>> crash, upon machine / node restart, all the player_games should be
>>> restarted and should be at the point where the player_games were when
>>> the machine crashed.
>>>
>>> Possible supervision strategy :
>>> --------------------------------------
>>>
>>> 1. Create a simple_one_for_one supervisor player_game_sup which upon
>>> starting up for the first time would have no children associated with
>>> them. Use supervisor:start_child to start each process
>>> 2. The supervisor creates an entry in a database (say mnesia) every
>>> time it launches a new process
>>> 3. Each player_game updates the entry every time the score gets
>>> modified. Upon termination that entry gets deleted
>>> 4. Post crash, the supervisor is started again (say after an
>>> application restart or via another supervisor)
>>> 5. (Here's the difference). By default the supervisor will not restart
>>> the dynamically added children (all the player_games). However we
>>> modify the init code to inspect the database and launch a player_game
>>> for each record it finds.
>>
>> How? I don't think you can instruct a simple_one_for_one supervisor to
>> create children from it's init/1 callback. From the documentation...
>>
>> http://www.erlang.org/doc/man/supervisor.html#Module:init-1
>>
>> "...No child process is then started during the initialization phase,  
>> but
>> all children are assumed to be started dynamically using
>> supervisor:start_child/2..."
>
> Fair point. Wasn't something that struck me as an issue then, but yes,
> supervisor starting dynamic children inside init doesn't quite rock.
>
>> AFIAK, creating dynamic children (calling supervisor:start_child/2) has  
>> to
>> be done after the supervisor has initialised by a process other than the
>> supervisor process.
>
> Certainly. And your separate modeling of a lane_ldr (later down this
> mail) helps that.
>
>> This is normally not a problem if you are calling start_child/2 during  
>> the
>> "normal" operation of the application because the supervisor in  
>> question is
>> likely to already be up. But here, you want to call start_child/2 at
>> *startup*. From my experience with this precise matter, this requires  
>> some
>> process coordination.
>>
>>> The player_game initialises itself to the
>>> current state as in the database and the game(s) can continue where
>>> it/they left off.
>>>
>>> My questions :
>>> --------------------
>>> a. Does it make sense to move the responsibility to the supervisor to
>>> update the database each time a new player game is started or
>>> completed ?
>>
>> I personally don't see the advantage of doing this. Besides (as per my
>> understanding of OTP design principles), a supervisor's job should be  
>> just
>> that -- supervising workers and not doing work itself.
>>
>> Doing this from the your worker gen_servers make more sense to me and  
>> seems
>> more natural. i.e Reading the scores from the DB the during  
>> player_game:init
>> and writing them every time a score gets bumped or something similar.
>>
>
> I agree
>
>
>> Possible supervision strategy 2a: (Loader version)
>> --------------------------------------------------
>>
>> Rather than separate dynamic children for players and games as in  
>> Strategy
>> 1, instead, each lane stores, as part of it's state, info on the current
>> game (the players playing on the lane and their state/scores). The
>> supervision tree might look like this...
>>
>>           alley_sup
>>          /         \
>>  lane_ldr  ___lanes_sup_____
>>           /       |     :   \
>>        lane(1)  lane(2) .. lane(N)
>>
>> * Application has a startup configuration parameter no_of_lanes which  
>> comes
>> from a conf file or the .app file and loaded by the alley_sup...
>>
>
> This is a suggestion thats really had me thinking. I suspect there's a
> bit of the traditional OO modeling experience which is grumbling about
> not being able to model a game or a player game.

It's not that you can't model them, it's that you don't need to.

One mantra in Erlang literature (e.g. Casarini & Thompson, pg110), is to  
create a process for every concurrent *activity* you observe in the real  
world and not every *task* you observe. So you don't necessarily need to  
use a process for every "object" you see in the real world.

With this in mind, my immediate interpretation of your application was in  
two ways:

A)

* You have a bowling alley which has lanes.
* Different _lanes_ can be *concurrently* used at the same time: map these  
to processes.
* Only 1 player can use a _lane_ at a time: no need for player processes.
* Only 1 game can take place on a _lane_ at a time: no need for game  
processes.
* It follows that players and their game are just the state of each  
concurrently used _lane_.

So you only need processes for lanes.

Alternatively, B)

* You have a bowling alley where people play games.
* Several _games_ can be *concurrently* played at the same time: map these  
to processes.
* Only 1 player can make a _game_ play at a time: no need for player  
processes.
* Only 1 lane can be used per _game_: no need for lane processes.
* It follows that players and their lane are just the state of each  
concurrently played game.

So you only need processes for games.

A) *might* be easier to implement than B) when you have to interact with  
hardware that manages the lane machinery, which is why I suggested it. But  
either way, you only need *one* class of processes. IMO, introducing more  
just complicates matters unnecessarily.

> I guess thats a
> matter of learning / unlearning / getting used to.
>

Modeling an your app using processes indeed very different from modeling  
using OO objects and takes some getting used to. It helps if you've done  
some multi-threaded server development before.

>> * lane_sup is a simple_one_for_one supervisor of any number of lanes but
>> initially has none.
>> * Now here is the trick: lane_ldr is a gen_server is initialised with
>> No_Of_Lanes. It's job is to call supervisor:start_child No_Of_Lanes  
>> times at
>> startup then vanish...
>
> Cool.
>
>> * Whenever a lane is started by the sup, it loads the most recent game  
>> from
>> the DB, or just a simple text file (lane_1.game_state,  
>> lane_2.game_state,
>> etc -- not a big deal if a text file gets corrupted and a game is lost  
>> so a
>> DB might be overkill).
>> * Now whenever the score gets bumped, or a new game is starts, or a  
>> game is
>> concluded, the lane process writes the game state to your DB, or text  
>> file.
>> For the simple text file, you could just keep calling...
>>
>> write_game_state(Path, Game_State) ->
>>    ok = file:write_file(Path, io_lib:format("~p.", [Game_State])).
>
> yes, that was one the options I had in mind
>
>> Possible supervision strategy 2b: (Start Phase version)
>> -------------------------------------------------------
>>
>> I was tipped-off by Ulf Wiger on this thread...
>>
>> http://thread.gmane.org/gmane.comp.lang.erlang.general/48307/focus=48324
>>
>> ... that the initailsiation/coordination done by lane_ldr in 2a above is
>> precisely what the start phases feature of included applications is for!

**
>> This requires splitting the application into two,

** Sorry, that statement is actually false! See below.

>> but could be make things
>> more manageable for larger applications. So one could get rid of  
>> lane_ldr
>> and modify 2a to get something like...
>>
>>           alley_sup
>>               |
>>  bowling_app  |
>> - - - - - - - -|- - - - - - - -
>>  lanes_app    |
>>               |
>>       ___lanes_sup_____
>>      /       |     :   \
>>  lane(1)  lane(2) .. lane(N)
>>
>> * Split everything into two apps: the primary bowling_app and the  
>> included
>> lanes_app.
>> * The primary application would be pretty bare, and would start  
>> lanes_sup as
>> if it were one of it's own modules...
>
> Again a very interesting suggestion. Thanks. I'll certainly look into
> it (too hard to comment on it yet, since I'm still grokk'ing it).
>

Actually, I just realised that I was wrong. You don't *need* to use  
included applications to make use of start phases. The documentation  
groups start-phases under included applications so it's easy to get that  
impression.

So you don't actually need to split the application into two as I had  
erroneously stated. Instead, you could simplify 2a and just have one  
application bowling_app with lanes_sup as it's top-level supervisor...

         ___lanes_sup_____
        /       |     :   \
    lane(1)  lane(2) .. lane(N)

=== bowling_app.app ===
{application, bowling_app,
  [..
   {env,[{no_of_lanes,10}]},
   {mod, {application_starter,[bowling,[]]}},
   {start_phases, [{init,[]}, {go,[]}]}
    % Removed included_applications
   ..]}.

=== bowling_app.erl ===
-module(bowling_app).
-behaviour(application).
-export([start/2, stop/1, start_phase/3]).

%% Called on application:start
start(normal, StartArgs) ->
     lanes_sup:start(StartArgs).

stop(_) ->
     ok.

%% Called *after* entire sup tree is initialised
start_phase(init, normal, []) ->
     % If there's a DB, initialise it here
     ok;
start_phase(go, normal, []) ->
     {ok, No_Of_Lanes} = application:get_env(?MODULE, no_of_lanes),
     true = No_Of_Lanes >= 1,
     case start_lanes(No_Of_Lanes, 0) of
         No_Of_Lanes ->
             io:format("All lanes failed to init -- quitting  
application.~n"),
             {error, all_lanes_failed}; % Cause app to quit abnormally
         _ ->
             ok % One or more lanes init'ed, continue.
     end.

start_lanes(0, E) ->
     E; % Return no. of lanes that have failed to init
start_lanes(N, E) ->
     case supervisor:start_child(lanes_sup, [N]) of
         {ok, _} ->
             io:format("Started lane ~w.~n", [N]),
             start_lanes(N - 1, E);
         Err ->
             io:format("Error starting lane ~w: ~p.~n", [N, Err]),
             start_lanes(N - 1, E + 1)
     end.

== lanes_sup.erl ==
-behaviour(supervisor).
..
start(StartArgs) ->
     supervisor:start_link({local, ?MODULE}, ?MODULE, StartArgs).

init([]) ->
     {ok, {{simple_one_for_one, 1, 30},
             [{lane,
                 {lane, start, []},
                  permanent,
                  10000,
                  worker,
                  [lane]}]}}.

== lane.erl ==
Same as as before

There. That's better!

- Edmond -

> Once again, thanks a ton for this and the subsequent mails. They've
> certainly help me think more, and think much harder :)
>
> Dhananjay

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/