[erlang-questions] Understanding supervisor / start_link behaviour

Fri Jun 3 11:21:53 CEST 2011

True. This is a very valid point.

Personally I have very rarely used the live upgrade tools of a node
(relup/appup/release_handler etc) so I don't really know the bad side
of not putting everything under a supervision tree. But then again I
simply don't think the fuzz of specifying every single thing to
reload/change is worth the "uptime" mark.

The strategy I prefer is to have an architecture which enables me to;
take down a node gracefully (detaching itself from the cluster),
manually install a release (I.e. untar the release and changing
start_erl.data to point to it), and start up the node again. This
should not affect the system which should still be operational (say
you have 10 nodes and you do this upgrade one by one). Should the new
release not work or something unexpected turns up then just change the
start_erl.data file to point to the old release and bounce the node
(your version handling on your applications should support this
meaning v1.32.424 in this release has *exactly* the same code as
v1.32.424 in the previous release).

This way of working has been proven very successful to me (and the
systems I took part in building). Specifying relups and appups for
this kind of work is, in my opinion, tedious but some seem to think it
is worth the effort. However you do have a very important point to
consider when not hanging everything under a supervisor tree. If I had
only 2 nodes to consider maybe I'd want them up at all time but then
again they would be built in a way to handle if one goes down (E.g.
when I upgrade them).

2011/6/2 Frédéric Trottier-Hébert <fred.hebert@REDACTED>:
> There are disadvantages to *not* putting workers under the supervision tree, though. Namely, you'll be losing the ability to have the release handlers walk down the supervision trees to find which processes to suspend/update, and you'll then need to find a different way of doing things.
>
> This is a serious point to consider if you ever plan on going the way of releases/appups if the workers you use are to be long-lived (you don't want them to be killed during a purge). I'm not saying you didn't know this, but I felt I should point it out for the sake of having the arguments clear on the mailing list.
>
> --
> Fred Hébert
> http://www.erlang-solutions.com
>
>
> On 2011-06-02, at 05:53 AM, Mazen Harake wrote:
>
>> Steve,
>>
>> I wouldn't say that you are wrong. I think that you are reasoning good
>> about not putting the gen_event module under a supervisor because
>> *that is what links are for*. Just because you have a supervisor
>> doesn't mean the you shove everything underneath there! If the
>> gen_server and the gen_event are truly linked (meaning: gen_server
>> doesn't act as a "supervisor" keeping track of its gen_event process
>> and restarts it all the time but rather that they really are linked
>> and they crash together) then your approach, in my opinion, is good.
>>
>> There are great benefits in doing it in that way. Many will claim that
>> it is best practice to put *everything* under a supervisor but this is
>> simply not true. 90% of cases it *is* the best thing to do and many
>> times it is more about how you designed your application rather than
>> where to put the supervisors and their children but doing it the way
>> you did is not necessarily wrong.
>>
>> The only problem I see with your approach is that you have registered
>> the gen_event process which clearly isn't useful (since only the
>> gen_server should know about it, after all, it started it). Other than
>> that, this approach is extremely helpful and a nice way to clean up
>> things after they die/shutdown (Again: assuming truly linked).
>>
>> There is a big misconception in the community that everything
>> should/must look like the supervisor-tree model which shows how
>> gen_servers are put under supervisors and more supervisors under the
>> "top" supervisor but that is not enforced and the design principles
>> doesn't take many cases into account where this setup actually brings
>> more headache to the table than to just exit and clean up using linked
>> processes (because they do exist).
>>
>> /M
>>
>> On 1 June 2011 21:26, Steve Strong <steve@REDACTED> wrote:
>>> Hi,
>>>
>>> I've got some strange behaviour with gen_event within a supervision tree
>>> which I don't fully understand.  Consider the following supervisor
>>> (completely standard, feel free to skip over):
>>> <snip>
>>> -module(sup).
>>> -behaviour(supervisor).
>>> -export([start_link/0, init/1]).
>>> -define(SERVER, ?MODULE).
>>> start_link() ->
>>>     supervisor:start_link({local, ?SERVER}, ?MODULE, []).
>>> init([]) ->
>>>     Child1 = {child, {child, start_link, []}, permanent, 2000, worker,
>>> [child]},
>>>     {ok, {{one_for_all, 1000, 3600}, [Child1]}}.
>>> </snip>
>>> and corresponding gen_server (interesting code in bold):
>>> <snip>
>>> -module(child).
>>> -behaviour(gen_server).
>>> -export([start_link/0, init/1, handle_call/3, handle_cast/2,
>>> handle_info/2, terminate/2, code_change/3]).
>>> start_link() ->
>>>     gen_server:start_link({local, child}, child, [], []).
>>> init([]) ->
>>>     io:format("about to start gen_event~n"),
>>>     X = gen_event:start_link({local, my_gen_event}),
>>>     io:format("gen_event started with ~p~n", [X]),
>>>     {ok, _Pid} = X,
>>>     {ok, {}, 2000}.
>>> handle_call(_Request, _From, State) ->
>>>     {reply, ok, State}.
>>> handle_cast(_Msg, State) ->
>>>     {noreply, State}.
>>> handle_info(_Info, State) ->
>>>     io:format("about to crash...~n"),
>>>     1 = 2,
>>>     {noreply, State}.
>>> terminate(_Reason, _State) ->
>>>     ok.
>>> code_change(_OldVsn, State, _Extra) ->
>>>     {ok, State}.
>>> </snip>
>>> If I run this from an erl shell like this:
>>> <snip>
>>> --> erl
>>> Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2]
>>> [async-threads:0] [hipe] [kernel-poll:false]
>>> Eshell V5.8.2  (abort with ^G)
>>> 1> application:start(sasl), supervisor:start_link(sup, []).
>>> </snip>
>>>
>>> Then the supervisor & server start as expected.  After 2 seconds the server
>>> gets a timeout message and crashes itself; the supervisor obviously spots
>>> this and restarts it.  Within the init of the gen_server, it also does a
>>> start_link on a gen_event process.  By my understanding, whenever the
>>> gen_server process exits, the gen_event will also be terminated.
>>> However, every now and then I see the following output (a ton of sasl trace
>>> omitted for clarity!):
>>> <snip>
>>> about to crash...
>>> about to start gen_event
>>> gen_event started with {error,{already_started,<0.79.0>}}
>>> about to start gen_event
>>> gen_event started with {error,{already_started,<0.79.0>}}
>>> about to start gen_event
>>> </snip>
>>> What is happening is that the gen_server is crashing but on its restart the
>>> gen_event process is still running - hence the gen_server fails in its init
>>> and gets restarted again.  Sometimes this loop clears after a few
>>> iterations, other times it can continue until the parent supervisor gives
>>> up, packs its bags and goes home.
>>> So, my question is whether this is expected behaviour or not.  I assume that
>>> the termination of the linked child is happening asynchronously, and that
>>> the supervisor is hence restarting its children before things have cleaned
>>> up correctly - is that correct?
>>> I can fix this particular scenario by trapping exits within the gen_server,
>>> and then calling gen_event:stop within the terminate.  Is this type of
>>> processing necessary whenever a process is start_link'ed within a supervisor
>>> tree, or is what I'm doing considered bad practice?
>>> Thanks for your time,
>>> Steve
>>> --
>>> Steve Strong, Director, id3as
>>> twitter.com/srstrong
>>>
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>