[erlang-questions] (noob-help) Supervision strategies to automatically restart dynamically added children

Edmond Begumisa <>
Tue Mar 8 03:16:42 CET 2011


PS:

The disadvantage with 2c is that if lanes:init fails in one lane the  
entire application will fail to start unlike 2a and 2b that is tolerant of  
this.

This is why I personally prefer using a loader process or start-phases.

- Edmond -

On Tue, 08 Mar 2011 12:58:05 +1100, Edmond Begumisa  
<> wrote:

> A third option...
>
> Strategy 2c
> ------------
>
> I've found 2a and 2b useful when you want to use a simple_one_for_one  
> sup, but need to sometimes autostart some of it's children at startup  
> based on some persisted criteria as per your specific question.
>
> But in the case of eliminating player_game and game processes and having  
> only lanes (which I used as an example in 2a and 2b): The lanes are  
> always a fixed number from startup, so you could use a one_for_one  
> lanes_sup with a child-spec list, and have that at the top-level  
> eliminating dynamic children altogether.
>
>        ____lanes_sup____
>       /       |     :   \
>   lane(1)  lane(2) ... lane(n)
>
>
> === lanes_sup.erl ===
> -behaviour(supervisor).
> ..
> init([]) ->
>       {ok, No_Of_Lanes} = application:get_env(no_of_lanes),
>       ChildSpecs = [{Id, {lane,
>                           start, []},
>                           permanent,
>                           10000,
>                           worker,
>                           [lane]}
>                      || Id <- lists:seq(1,No_Of_Lanes)],
>       {ok, {{one_for_one, 1, 30}, ChildSpecs}}.
>
> === lanes.erl ===
> Same as 2a
>
> Now the supervisor will start the children instead of you having to do  
> it via supervisor:start_child/2. No more need for a loader or start  
> phases.
>
> - Edmond -
>
> On Mon, 07 Mar 2011 10:38:09 +1100, Edmond Begumisa  
> <> wrote:
>
>> Hi Dhananjay,
>>
>> I too struggled with this exact question for quite some time so I'll  
>> chime in here on the two techniques I used to solve it...
>>
>> On Thu, 03 Mar 2011 05:02:06 +1100, Dhananjay Nene  
>> <> wrote:
>>
>>> While supervisors are meant to automatically restart failed processes,
>>> there is one scenario I am as yet unable figure out which is the
>>> idiomatic approach to implement crash recovery under the default OTP
>>> scenarios. I have considered a solution, but being a relative newbie,
>>> I am not sure if it is idiomatic erlang and if there are better
>>> solutions.
>>>
>>> Question in short : If I have a supervisor which has a number of
>>> dynamic children, how do I set up a mechanism where in case of a
>>> complete system crash, all the dynamic children restart at the point
>>> they were when the system (including the supervisor) crashed.
>>>
>>> Question in long :
>>> =============
>>>
>>> Sample Context : A bowling game
>>> -------------------------------------------------
>>>
>>> Lets say I am writing the software to implement the software necessary
>>> to track various games at a bowling alley. I've set up the following
>>> processes :
>>>
>>> a. Lanes : If there are 10 lanes, there are 10 processes, one for each
>>> lane. These stay fixed for the entire duration of the program
>>> b. Games : A group of players might get together to start a game on a
>>> free lane. A new game will get created to track the game through its
>>> completion. When the game is over, this process shall terminate
>>> c. Players : Each game has a number of players. One process
>>> "player_game" is started per player. Sample state of a player game
>>> would include current score for the player and if the last two rolls
>>> were strike or a spare. For the purpose of brevity, the remainder of
>>> this mail only refers to this process and ignores the others
>>>
>>
>> You could reduce complexity by having each lane process maintain it's  
>> current game (players and scores) as part of it's state. The game and  
>> player_game processes appear unnecessarily confusing to me.
>>
>>> Objective :
>>> ---------------
>>>
>>> Assuming this is a single node implementation, if the machine were to
>>> crash, upon machine / node restart, all the player_games should be
>>> restarted and should be at the point where the player_games were when
>>> the machine crashed.
>>>
>>> Possible supervision strategy :
>>> --------------------------------------
>>>
>>> 1. Create a simple_one_for_one supervisor player_game_sup which upon
>>> starting up for the first time would have no children associated with
>>> them. Use supervisor:start_child to start each process
>>> 2. The supervisor creates an entry in a database (say mnesia) every
>>> time it launches a new process
>>> 3. Each player_game updates the entry every time the score gets
>>> modified. Upon termination that entry gets deleted
>>> 4. Post crash, the supervisor is started again (say after an
>>> application restart or via another supervisor)
>>> 5. (Here's the difference). By default the supervisor will not restart
>>> the dynamically added children (all the player_games). However we
>>> modify the init code to inspect the database and launch a player_game
>>> for each record it finds.
>>
>> How? I don't think you can instruct a simple_one_for_one supervisor to  
>> create children from it's init/1 callback. From the documentation...
>>
>> http://www.erlang.org/doc/man/supervisor.html#Module:init-1
>>
>> "...No child process is then started during the initialization phase,  
>> but all children are assumed to be started dynamically using  
>> supervisor:start_child/2..."
>>
>> Even if you switched to one_for_one with no child specs, I don't think  
>> you'd be able to call supervisor:start_child/2 from init/1 of the same  
>> supervisor since this function is called before the supervisor has  
>> finished initialising itself and it's the actual supervisor process  
>> doing the calling. You're likely to wait forever.
>>
>> AFIAK, creating dynamic children (calling supervisor:start_child/2) has  
>> to be done after the supervisor has initialised by a process other than  
>> the supervisor process.
>>
>> This is normally not a problem if you are calling start_child/2 during  
>> the "normal" operation of the application because the supervisor in  
>> question is likely to already be up. But here, you want to call  
>> start_child/2 at *startup*. From my experience with this precise  
>> matter, this requires some process coordination.
>>
>>> The player_game initialises itself to the
>>> current state as in the database and the game(s) can continue where
>>> it/they left off.
>>>
>>> My questions :
>>> --------------------
>>> a. Does it make sense to move the responsibility to the supervisor to
>>> update the database each time a new player game is started or
>>> completed ?
>>
>> I personally don't see the advantage of doing this. Besides (as per my  
>> understanding of OTP design principles), a supervisor's job should be  
>> just that -- supervising workers and not doing work itself.
>>
>> Doing this from the your worker gen_servers make more sense to me and  
>> seems more natural. i.e Reading the scores from the DB the during  
>> player_game:init and writing them every time a score gets bumped or  
>> something similar.
>>
>>> b. Is it an idiomatic way to implement crash recovery
>>
>> There is none. It's very application specific as Jesper has indicated.
>>
>> I've come across a couple of wide patterns, but the details of where to  
>> put checkpoints can't be generalised. For instance; although you are  
>> specifically asking about a single node, multi-node hot take-over with  
>> no DB/persistence is another way. I was recently privy to a very  
>> interesting discussion on that technique. You might want to check it  
>> out for a future project...
>>
>> http://thread.gmane.org/gmane.comp.lang.erlang.general/50258/focus=50269
>>
>>> c. Are there any other perhaps superior ways of implementing this?
>>>
>>
>> I don't know about superior, I just don't think your first suggestion  
>> will actually work. I can offer of 2 possibilities each of which I've  
>> used...
>>
>> Possible supervision strategy 2a: (Loader version)
>> --------------------------------------------------
>>
>> Rather than separate dynamic children for players and games as in  
>> Strategy 1, instead, each lane stores, as part of it's state, info on  
>> the current game (the players playing on the lane and their  
>> state/scores). The supervision tree might look like this...
>>
>>             alley_sup
>>            /         \
>>    lane_ldr  ___lanes_sup_____
>>             /       |     :   \
>>          lane(1)  lane(2) .. lane(N)
>>
>> * Application has a startup configuration parameter no_of_lanes which  
>> comes from a conf file or the .app file and loaded by the alley_sup...
>>
>> === bowling_app.app ===
>> {application, bowling_app,
>>   [{..
>>     {env,[{no_of_lanes,10}]},
>>     ..}]}.
>>
>> === alley_sup.erl ===
>> -behaviour(supervisor).
>> ..
>> init([]) ->
>>      {ok, No_Of_Lanes} = application:get_env(no_of_lanes),
>>      {ok, {{one_for_one, 1, 30},
>>         [{lanes_sup,
>>              {lanes_sup, start, []},
>>               permanent,
>>               infinity,
>>               supervisor,
>>               [lanes_sup]},
>>          {lanes_ldr,
>>              {lanes_ldr, start, [No_Of_Lanes]},
>>               temporary, % Starts lanes_sup children then disappears
>>               6000,
>>               worker,
>>               [lanes_ldr]}]}}.
>>
>> * lane_sup is a simple_one_for_one supervisor of any number of lanes  
>> but initially has none.
>> * Now here is the trick: lane_ldr is a gen_server is initialised with  
>> No_Of_Lanes. It's job is to call supervisor:start_child No_Of_Lanes  
>> times at startup then vanish...
>>
>> === lane_ldr ===
>> -behaviour(gen_server).
>> ..
>> init(No_Of_Lanes) when No_Of_Lanes >= 1 ->
>>      case start_lanes(No_Of_Lanes, 0) of
>>          No_Of_Lanes ->
>>              io:format("All lanes failed to init -- quitting  
>> application.~n"),
>>              {stop, all_lanes_failed}; % Cause alley_sup to quit  
>> abnormally
>>          _ ->
>>              io:format("Lane loader exiting.~n"),
>>              ignore % One or more lanes init'ed; loader's work is done.
>>      end.
>>
>> start_lanes(0, E) ->
>>      E; % Return no. of lanes that have failed to init
>> start_lanes(N, E) ->
>>      case supervisor:start_child(lanes_sup, [N]) of
>>          {ok, _} ->
>>              io:format("Started lane ~w.~n", [N]),
>>              start_lanes(N - 1, E);
>>          Err ->
>>              io:format("Error starting lane ~w: ~p.~n", [N, Err]),
>>              start_lanes(N - 1, E + 1)
>>      end.
>>
>> %%% These are just placeholders for compiler warnings/dialyzer
>>
>> handle_call(void, _, void) ->
>>      {noreply, void}.
>>
>> handle_cast(void, void) ->
>>      {noreply, void}.
>>
>> handle_info(void, void) ->
>>      {noreply, void}.
>>
>> terminate(_, _) ->
>>      ignore.
>>
>> code_change(_, void, _) ->
>>      {ok, void}.
>>
>> * Whenever a lane is started by the sup, it loads the most recent game  
>>  from the DB, or just a simple text file (lane_1.game_state,  
>> lane_2.game_state, etc -- not a big deal if a text file gets corrupted  
>> and a game is lost so a DB might be overkill). Possibly something along  
>> the lines of...
>>
>> === lane.erl ===
>> -behaviour(gen_server).
>> ..
>> -record(player_state, {frame = 0, % NB: Removed player_name
>>                         shot = 1,
>>                         bonus_shot = false,
>>                         last_shot = normal,
>>                         prior_to_last_shot = normal,
>>                         max_pins = 10,
>>                         score = 0}).
>>
>> start(Id) ->
>>      gen_server:start_link(?MODULE, Id, []).
>>
>> init(Id) ->
>>      process_flag(trap_exit, true),
>>      Path = filename:join(code:priv_dir(bowling_app),
>>                           "lane_" ++ integer_to_list(Id) ++  
>> ".game_state"),
>>      % Game State is a proplist of player_state records with players'  
>> name as key
>>      %    [{Player_Name1, #player_state{}}, {Player_Name2,  
>> #player_state{}}, .. ]
>>      {ok, Game_State} = try read_game_state(Path)
>>                         catch
>>                              _:{badmatch, {error, enoent}} -> % File  
>> not found
>>                                  {file:write_file(Path, "[]."), []};
>>                              _:Err ->                         % Discard  
>> bad state
>>                                  io:format("Zeroing corrupt game file  
>> ~s: ~p~n.",
>>                                              [Path, Err]),
>>                                  {file:write_file(Path, "[]."), []}
>>                         end,
>>      {ok, {Game_State, Path, ..maybe some non-persisted state..}}.
>>
>> %% Assert the happy-case for good game state when reloading it
>> read_game_state(Path) ->
>>      {ok, [Game_State]} = file:consult(Path),
>>      true = is_list(Game_State),
>>      lists:foreach(fun({Player_Name, Player_State}) ->
>>                      true = is_list(Player_Name),
>>                      true = is_record(Player_State, player_state),
>>                      % Maybe do some other checks
>>                      ok
>>                    end, Game_State),
>>      {ok, Game_State}.
>> ..
>>
>> NB: You'd probably use error_logger instead of all the io:formats.
>>
>> * Now whenever the score gets bumped, or a new game is starts, or a  
>> game is concluded, the lane process writes the game state to your DB,  
>> or text file. For the simple text file, you could just keep calling...
>>
>> write_game_state(Path, Game_State) ->
>>      ok = file:write_file(Path, io_lib:format("~p.", [Game_State])).
>>
>> Possible supervision strategy 2b: (Start Phase version)
>> -------------------------------------------------------
>>
>> I was tipped-off by Ulf Wiger on this thread...
>>
>> http://thread.gmane.org/gmane.comp.lang.erlang.general/48307/focus=48324
>>
>> ... that the initailsiation/coordination done by lane_ldr in 2a above  
>> is precisely what the start phases feature of included applications is  
>> for! This requires splitting the application into two, but could be  
>> make things more manageable for larger applications. So one could get  
>> rid of lane_ldr and modify 2a to get something like...
>>
>>             alley_sup
>>                 |
>>    bowling_app  |
>> - - - - - - - -|- - - - - - - -
>>    lanes_app    |
>>                 |
>>         ___lanes_sup_____
>>        /       |     :   \
>>    lane(1)  lane(2) .. lane(N)
>>
>> * Split everything into two apps: the primary bowling_app and the  
>> included lanes_app.
>> * The primary application would be pretty bare, and would start  
>> lanes_sup as if it were one of it's own modules...
>>
>> === bowling_app.app ===
>> {application, bowling_app,
>>   [..
>>    {mod, {application_starter,[bowling_app,[]]}},
>>    {included_applications, [lanes_app]},
>>    {start_phases, [{init,[]}, {go,[]}]}
>>    ..
>>   ]}.
>>
>> === bowling_app.erl ===
>> -behaviour(application).
>> ..
>> %% Called on application:start
>> start(normal, StartArgs) ->
>>      alley_sup:start(StartArgs).
>>
>> %% Called *after* entire sup tree is initialised
>> start_phase(init, normal, []) ->
>>      % If there's a DB, initialise it here
>>      ok;
>> start_phase(go, normal, []) ->
>>      ok.
>> ..
>>
>> === alley_sup.erl ===
>> -behaviour(supervisor).
>> ..
>> init([]) ->
>>      {ok, {{one_for_one, 1, 30},
>>         [{lanes_sup,
>>              {lanes_sup, start, []},
>>               permanent,
>>               infinity,
>>               supervisor,
>>               [lanes_sup]}]}}. % Mod of included app.
>>
>> * Nothing else is needed in the primary app.
>> * The second application will be responsible for spawning the dynamic  
>> children on startup...
>>
>> === lanes_app.app ===
>> {application, lanes_app,
>>   [..
>>    {env,[{no_of_lanes,10}]},
>>    {mod,{lanes,[]}},
>>    {start_phases, [{init,[]}, {go,[]}]}
>>    ..
>>   ]}.
>>
>> === lanes_app.erl ===
>> -behaviour(application).
>> ..
>> %% NOT called
>> start(normal, StartArgs) ->
>>      lanes_sup:start(StartArgs).
>>
>> %% Called *after* entire sup tree is initialised
>> %% and corresponding bowling_app:start_phase
>> start_phase(init, normal, []) ->
>>      ok;
>> start_phase(go, normal, []) ->
>>      {ok, No_Of_Lanes} = application:get_env(?MODULE, no_of_lanes),
>>      true = No_Of_Lanes >= 1,
>>      case start_lanes(No_Of_Lanes, 0) of
>>          No_Of_Lanes ->
>>              io:format("All lanes failed to init -- quitting  
>> application.~n"),
>>              {error, all_lanes_failed}; % Cause app to quit abnormally
>>          _ ->
>>              ok % One or more lanes init'ed, continue.
>>      end.
>>
>> start_lanes(0, E) ->
>>      E; % Return no. of lanes that have failed to init
>> start_lanes(N, E) ->
>>      case supervisor:start_child(lanes_sup, [N]) of
>>          {ok, _} ->
>>              io:format("Started lane ~w.~n", [N]),
>>              start_lanes(N - 1, E);
>>          Err ->
>>              io:format("Error starting lane ~w: ~p.~n", [N, Err]),
>>              start_lanes(N - 1, E + 1)
>>      end.
>>
>> === lanes_sup.erl ===
>> Same as in Strategy 2a
>>
>> === lane.erl ===
>> Same as in Strategy 2a
>>
>> Strategy 2b is cleaner to me than Strategy 2a, even though it requires  
>> splitting an application into two which many people seem to have a  
>> problem with.
>>
>> - Edmond -
>>
>>
>>> FWIW : the code I am using to learn erlang is at
>>> https://github.com/dnene/bowling . Its not particularly interesting at
>>> this stage since it is still under development.
>>>
>>> Thanks
>>> Dhananjay
>>>
>>> PS: Apologies for posting it to erlang-questions after earlier posting
>>> it to erlang programming google group. Those monitoring the latter
>>> will receive this question twice.
>>>
>>> ________________________________________________________________
>>> erlang-questions (at) erlang.org mailing list.
>>> See http://www.erlang.org/faq.html
>>> To unsubscribe; mailto:
>>>
>>
>>
>
>


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/


More information about the erlang-questions mailing list