[erlang-questions] Accessing sibling processes in a supervisor.

Fri Dec 21 15:04:46 CET 2012

Ah, I understand the error. It's a very simple mistake there.
The supervision structure is a bit as follows:

         supersup
            |
         pool_sup
         |      \
        serv   worker_sup
                   |
                 workers

Here, 'serv' is started as a permanent process, and it starts
'worker_sup' as another permanent process. Whenever one of them dies,
both get killed, and the supervisor then restarts *both* of them. Then,
the next time around, when the gen_server tries to start the child, it
can't access it because it was already started. A very minor fix can
handle the issue. Instead of just doing {ok, Pid} =
supervisor:start_child(...), the following can be done:

{ok, Pid} = case supervisor:start_child(...) of
    {ok, NewPid} -> {ok, NewPid};
    {error, {already_started, OldPidPid}} -> {ok, OldPid}
end
...

Note that the error that currently exists is not that much of a big deal
since as soon as the restart intensity is reached, things get cleaned
up and the next restart is done on a blank slate.

An attempt could have been done at making the worker supervisor a
temporary process. At this point, it would not be restarted ever and
we'd get the behavior we want, but we'd need to link it to the server so
that they both die together and always provoke the server to die. Such a
fix would instead be to change the worker_sup strategy to 'temporary'
and use something as follows:

{ok, Pid} = case supervisor:start_child(...) of
link(Pid),
....

I'll take a better look at it later today, I've got to run for now.

Regards,
Fred.

On 12/21, Karolis Petrauskas wrote:
> Thank you, Fred, for a great book!
> 
> I changed your example (ppool-1.0/src/ppool_serv.erl) a bit to
> illustrate my concern. I just added a function die/1, that does
> nothing apart from stopping gen_server with reason != normal. Bellow
> is a diff showing my changes:
> 
> learn-you-some-erlang$ git diff
> diff --git a/ppool-1.0/src/ppool_serv.erl b/ppool-1.0/src/ppool_serv.erl
> index bf901dc..ff11dfb 100644
> --- a/ppool-1.0/src/ppool_serv.erl
> +++ b/ppool-1.0/src/ppool_serv.erl
> @@ -1,6 +1,6 @@
>  -module(ppool_serv).
>  -behaviour(gen_server).
> --export([start/4, start_link/4, run/2, sync_queue/2, async_queue/2, stop/1]).
> +-export([start/4, start_link/4, run/2, sync_queue/2, async_queue/2,
> stop/1, die/1]).
>  -export([init/1, handle_call/3, handle_cast/2, handle_info/2,
>           code_change/3, terminate/2]).
> 
> @@ -36,6 +36,9 @@ async_queue(Name, Args) ->
>  stop(Name) ->
>      gen_server:call(Name, stop).
> 
> +die(Name) ->
> +    gen_server:call(Name, die).
> +
>  %% Gen server
>  init({Limit, MFA, Sup}) ->
>      %% We need to find the Pid of the worker supervisor from here,
> @@ -59,6 +62,8 @@ handle_call({sync, Args},  From, S = #state{queue=Q}) ->
> 
>  handle_call(stop, _From, State) ->
>      {stop, normal, ok, State};
> +handle_call(die, _From, State) ->
> +    {stop, error, dying, State};
>  handle_call(_Msg, _From, State) ->
>      {noreply, State}.
> 
> I have compiled it and then called the following functions:
> 
>     application:start(ppool).
>     ppool:start_pool(nagger, 2, {ppool_nagger, start_link, []}).
>     ppool_serv:die(nagger).
> 
> 
> and got the following errors:
> 
> =ERROR REPORT==== 21-Dec-2012::11:39:44 ===
> ** Generic server nagger terminating
> ** Last message in was die
> ** When Server state == {state,2,<0.43.0>,{0,nil},{[],[]}}
> ** Reason for termination ==
> ** error
> 
> =ERROR REPORT==== 21-Dec-2012::11:39:44 ===
> ** Generic server nagger terminating
> ** Last message in was {start_worker_supervisor,<0.41.0>,
>                            {ppool_nagger,start_link,[]}}
> ** When Server state == {state,2,undefined,{0,nil},{[],[]}}
> ** Reason for termination ==
> ** {{badmatch,{error,{already_started,<0.46.0>}}},
>     [{ppool_serv,handle_info,2,[{file,"src/ppool_serv.erl"},{line,90}]},
>      {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,607}]},
>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
> 
> The second error is the one I was talking about. On the other hand,
> the entire application has not crashed, but the ppool_sup was
> terminated due to reached_max_restart_intensity and then restarted.
> Was that the intended behaviour?
> 
> Best regards,
> Karolis Petrauskas
> 
> On Fri, Dec 21, 2012 at 4:19 AM, Fred Hebert <mononcqc@REDACTED> wrote:
> > Hi, Author of LYSE here.
> >
> > The strategy I mention will work because the direct supervisor of both
> > processes uses a 'one_for_all' restart strategy, meaning that if the
> > gen_server crashes, or the supervisor it relies on crashes, both are
> > killed and then restarted. The names are unregistered automatically upon
> > their death, and things should back up error free.
> >
> > This is a decision made because the gen_server strongly relies on the
> > supervisor to handle its children, and if it crashes, there's no easy
> > way for a new server to pick up from where the other left regarding
> > messages, references, tasks to do, etc. If the supervisor dies, then it
> > means the children crashed a lot and you had some kind of problem there
> > anyway.
> >
> > In both cases, it is *a lot* simpler (at least, I think it is) to crash
> > and restart everything from a fresh state rather than have a new server
> > register under a new name and try to figure out what the hell was going
> > on before it came into existence.
> >
> > I think a lot of people want to limit what crashes in their systems in a
> > way to eliminate errors as much as possible, but in this case I believe
> > crashing more stuff makes the case much simpler, and more likely to
> > avoid weird heisenbugs that take weeks to fix down the line when you
> > wonder why you seem to be missing data or have rogue workers hanging
> > around when they shouldn't be. There's a very direct dependency between
> > the two processes, and they don't necessarily make sense without the
> > other being there. They spawned together, and they should die together.
> >
> > Given this design decision, it becomes somewhat useless to register the
> > names for the sake of it, and just passing the pid directly is entirely
> > fine.
> >
> > Regards,
> > Fred.
> >
> > On 12/21, Karolis Petrauskas wrote:
> >> Hi,
> >>
> >> I have a question regarding an example [1] in LYSE. The example
> >> proposes a supervision scheme for a server-worker like processes. I
> >> have used this scheme a lot (I learnt Erlang from this book mainly),
> >> but now I'm in doubt. Is the proposed way of accessing a sibling
> >> process (access worker_sup from ppool_serv, see [1]) is the correct
> >> one? How will the ppool_serv get a PID of the worker_sup after a crash
> >> and restart? As I understand, if one of the processes will crash, both
> >> processes will be restarted and the server should get an error while
> >> starting the worker_sup again (the corresponding child already
> >> exists):
> >>
> >>     handle_info({start_worker_supervisor, Sup, MFA}, S = #state{}) ->
> >>         {ok, Pid} = supervisor:start_child(Sup, ?SPEC(MFA)),        %
> >> Will this work after restart?
> >>
> >> Or maybe I missed the point? I am aware of some other ways of getting
> >> processes to know each other [2], but I would like to get your
> >> comments on this example. Other schemes for implementing communication
> >> of anonymous sibling processes (in the supervision tree) would be
> >> interesting also.
> >>
> >> [1] http://learnyousomeerlang.com/building-applications-with-otp#implementing-the-supervisors
> >> [2] http://erlang.2086793.n4.nabble.com/supervisor-children-s-pid-td3530959.html#a3531973
> >>
> >> Best regards,
> >> Karolis Petrauskas
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@REDACTED
> >> http://erlang.org/mailman/listinfo/erlang-questions