[erlang-questions] Sharing child processes across supervisors

Sat Mar 6 01:56:59 CET 2010

It's funny.  A lot of people seem to read Joe's book and assume that you learn about the primitives (spawn, link, !, receive, etc.) but you then don't use them (using gen_* instead).

Ironically, I've found that is completely not the case.  While you can't really use receive, handle_info/2 provides almost the same functionality.  Otherwise, I routinely have gen_servers spawn large numbers of linked processes for various purposes.

Using links for this not only ensures that cross-connected processes die when the process exits, it also can create sane reset behavior in the face of other nodes failing (for distributed applications).

Cross-server links can be touchy, so do watch for bugs.  Specifically, since you can't really spawn_link the other servers, you have to rely on linking to a nonexistent process causing a failure.  Specifically, assume that you have three servers, M, A, and B.  M is the "master" server, and A / B link to it.

In a stable state, assume that A and B both link to M on startup.  This is fairly sane, as M doesn't have to have knowledge about all of its "consumers".  In this case, assume that everything starts fine.  This is common, as your applications tend to start in a certain order.  Now assume that B dies.  In this case, it will kill M, which will subsequently kill A.  Now, these processes will be restarted in whichever order their supervisors happen to get scheduled.

This can create chaos that causes, for example, A or B to respawn a few times before M starts.  While this is often fine, watch out because it (harmlessly) clutters logs and (quite harmfully) can cause the supervisor to exit due to too many restarts.  The easy way to fix this is to put some delay in A and B's init function.  Unfortunately, this sometimes fails spectacularly on really loaded systems.

The "better" way, in my opinion, is to have A & B loop over whereis until M is registered.  I use the following function, in the init of the linking processes:

> -define(DEFAULT_TRIES,5).
> -define(WAIT_TIME,100).
> 
> wait_on(Who) ->
>   wait_on(Who,?DEFAULT_TRIES).
> 
> wait_on(_Who,0) ->
>   erlang:error(noproc);
> wait_on(Who,Tries) when Tries > 0 ->
>   case whereis(Who) of
>     undefined ->
>       timer:sleep(?WAIT_TIME),
>       wait_on(Who,Tries - 1);
>     Pid when is_pid(Pid) ->
>       {started,Pid}
>   end.

Alternatively, you could use a simpler version that wait forever, and use the {timeout,T} option to gen_server:start_link/3,4.  Either way, this is usually much more "supervisor-friendly".

Have fun!

On Mar 5, 2010, at 4:20 PM, Garrett Smith wrote:

> Thanks! This is straight forward and cleans up the supervisory
> hierarchy that I was using.
> 
> On Fri, Mar 5, 2010 at 4:16 PM, Jayson Vantuyl <kagato@REDACTED> wrote:
>> Have the gen_server link to the other gen_servers (or vice-versa).  Then when one fails, the other dies, and the supervisors in the remote apps take care of it.  It might require some synchronization around the restarting (or maybe a delay in somebody's init), but I've done this sort of thing a lot.
>> 
>> On Mar 5, 2010, at 2:04 PM, Garrett Smith wrote:
>> 
>>> I have a gen_server that maintains a connection to something. I'd like
>>> to have a single such gen_server per release (VM instance).
>>> 
>>> I generally run this server under a one_for_all supervisor -- anyone
>>> who depends on that connection is also under this supervisor. When the
>>> connection fails, the dependencies are all restarted.
>>> 
>>> If I have multiple OTP applications that share this connection, each
>>> application will want to supervise the gen_server. I could merge the
>>> supervisory trees of the multiple applications into one, but this
>>> doesn't feel right at all - I want to keep the applications as
>>> separate as possible.
>>> 
>>> I'm tempted to modify the start_link of the connection to look like this:
>>> 
>>>  start_link() ->
>>>    case gen_server:start_link({local, ?SERVER}, ?MODULE, [], []) of
>>>      {ok, Pid} -> {ok, Pid};
>>>      {already_started, Pid} -> {ok, Pid};
>>>      Other -> Other
>>>    end.
>>> 
>>> My thinking is that this would fake out supervisors that subsequently
>>> tried to start the connection, causing them to link and supervisor as
>>> if they actually started it. Naively, it would seem that a connection
>>> failure would be detected by all of the linked supervisors, triggering
>>> the expected cascades. One of the applications would end up restarting
>>> the connection and the rest would link per the "fake out" above.
>>> 
>>> Would this approach be bad for any reason? Is there a better or
>>> standard way of getting supervision across applications?
>>> 
>>> Garrett
>>> 
>>> ________________________________________________________________
>>> erlang-questions (at) erlang.org mailing list.
>>> See http://www.erlang.org/faq.html
>>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>> 
>> 
>> --
>> Jayson Vantuyl
>> kagato@REDACTED
>> 
>> 
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>> 
>> 
> 
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
> 

-- 
Jayson Vantuyl
kagato@REDACTED