[erlang-questions] network outage recovery, again

Tue Oct 9 18:24:36 CEST 2007

(sorry for the length of this.  just trying to provide enough context)

i've got an app that i'm trying to make more robust in the face of 
some recent network outages, as discussed previously:
http://www.erlang.org/pipermail/erlang-questions/2007-June/027535.html

a proposed solution was to have the compute nodes check in periodically.
so i run the nodes something like this:

   erl -sname node -run node oam_start node@REDACTED -detached

where node:oam_start is:

    oam_start( _ ) ->
        checkin(),
        timer:apply_interval( ?HeartbeatInterval, node, checkin, [] ).

    checkin() -> { master, masternode() } ! { checkin, node() }

to test this, i pull the plug on the node, wait about a minute, and
then the master gets a nodedown message.  plug it back in and wait...
nothing.  check erlang on the node, and it's not running.

if i run it on the node without the -detached flag, and start by hand,
it works ok:

    Eshell V5.3.6.3  (abort with ^G)
    (node@REDACTED)1> node:oamStart( [] ).
    Node node@REDACTED is checking in to fathom@REDACTED
    {ok,{interval,#Ref<0.0.0.49>}}
    (node@REDACTED)2> Node node@REDACTED is checking in to fathom@REDACTED
    Node node@REDACTED is checking in to fathom@REDACTED

    =ERROR REPORT==== 3-Oct-2007::15:11:44 ===
    ** Node fathom@REDACTED not responding **
    ** Removing (timedout) connection **

    Node node@REDACTED is checking in to fathom@REDACTED
    Node node@REDACTED is checking in to fathom@REDACTED
    Node node@REDACTED is checking in to fathom@REDACTED

so it works within the erlang shell, but not detached.  i assumed
this meant that there was some exception that the shell was catching,
but that, uncaught, caused erlang to go away when run with -detached.
so i wrapped the message send in a catch().  no dice.  same problem.

so, question 1:
what is happening in the above to cause the node to die?

what can i do to ensure that the node doesn't die, but keeps trying
until the network heals itself (typically a few minutes)?

question 2:
in trying to figure out this problem, i noticed that i never got
a nodeup message when the node starts.  so i added a call to
net_kernel:monitor_nodes(true), and now i do (i had previously
had the master ping each of its nodes, and call monitor_node()
on all that respond.  this caused me to get the nodedown messages
that master used to monitor its resources, but no nodeup messages).  

this behavior makes me wonder if there's any reason for the above
mechanism at all.  that is, can i just have master call monitor_nodes(true),
and then rely on nodeup messages to tell me when a node is back?  will
i get one of these if the network heals?  or just when a node actually
starts?

thanks for any insight you can provide.

----
Garry Hodgson, Senior Software Geek, AT&T CSO

nobody can do everything, but everybody can do something.
do something.