[erlang-questions] Resisting "noconnection" / Remote termination of nodes

Thu Apr 26 11:38:21 CEST 2012

 Tue, Apr 24, 2012 at 13:52:18, olivier.boudeville wrote about "[erlang-questions] Resisting "noconnection" / Remote termination of nodes": 

> As I want now these terminations to be synchronous (i.e. I want my 
> terminate function to return only when all nodes are down for sure), I 
> used to rely on checking their termination using net_adm:ping/1 (waiting 
> for pong to become pang), but kept on getting (systematically) 
> 'noconnection' errors (exceptions?), which do not seem to be catchable (at 
> least not with a 'try .. catch T:E ->.. end' clause).

Seems this is good case for erlang:monitor_node(). Initially, check each
node from the list and subscribe to nodedown messages using
monitor_node() if it is shown alive. Then make multicast halt request
and loop around nodedown messages, dropping nodes from list on each one.
Exit on empty list.
Also net_kernel:monitor_nodes() provide similar ability.

The only side issue I see is that it tries to connect to node if there
is no connection yet, so you one can occur a potential race if some
another agent performs similar action. If this is your case, separate
node control to own manager process.

> I feel I would need something like net_kernel:unconnect_node/1.

It's erlang:disconnect_node/1, but I doubt it is useful for your goal.

> My question now: how to deal gracefully with such a synchronous node 
> shutdown and to resist to the (intended) loss of node(s)? 

Seems something is unclear in your description so feel free to
reformulate it.

-netch-