Starting processes at remote nodes

Tue Jun 8 01:18:18 CEST 2004

On Mon, Jun 07, 2004 at 09:39:40AM +0200, Martin Bjorklund wrote:
}  
}  Suppose that supervisor S runs on A and supervises W on node B.  now
}  the link between A and B goes down.  S will get a
}  {'EXIT',W,noconnection} message.  Now what should S do?

Indeed.  I hadn't gotten that far yet.  What happens is that the
supervisor handles it as any other exit.  The supervisor immediately
restarts the child, however since it is using the same child
specification as before it fails to start.  Again the supervisor
immediately tries to restart the child.  This repeats until either the
start succeeds or the restart strategy causes the supervisor to
shutdown.

What should it do?  In my case what I would want is to restart the child
on another node.  This is done easily enough by specifying your own
start function.  When the child dies the start function is run again
and it chooses another node.  If no nodes are available the restart 
strategy controls what happens to the supervisor.  This seems good 
enough for my needs.  I will except though that the generic behaviour
is not enough on it's own to handle supervising remote processes.

}  The other problem is that the supervision tree is used to find
}  processes during code upgrade.  If the module for W is upgraded on
}  node B, W will not be found.

I don't understand this one.  Are you refering to the release handler?

As I understand it the release handler inspects the supervision tree to 
find the processes belonging to an application.  I looked and it does
seem to find the remote processes.  In the return from 
release_handler_1:get_supervised_procs/0 I see my local supervisor
and the remote worker:

 	{<0.50.0>,undefined,<5820.66.0>,[server]}
	 {undefined,undefined,<0.50.0>,[starter_sup]}

The only place I see this used is to suspend and resume processes
and I have tried and sys:suspend/1 and sys:resume/1 work on the 
remote process.  Since the application is running on node A it is 
there that you would do a release upgrade and not on node B.  I
guess you would want to load the new code on node B during the
upgrade.  If I hadn't used the supervision tree, and had 
implemented my own supervisor to handle remote workers, they 
wouldn't have been found by the release handler (so this may in
fact be an urgument in favour).

I have never written a .relup let alone one for a distributed
application but from what I see the tools to accomplish this seem to
be there.

	-Vance