[erlang-questions] network outage recovery

Garry Hodgson garry@REDACTED
Mon Jun 25 20:25:21 CEST 2007


we've encountered a problem recently in an old erlang system, and i could
use some thoughts about the best way to resolve it.

the system has a master server, which keeps track of a set of compute nodes
that it delegates requests to.  it was written before i knew about otp, so the master
relies on { nodedown, Node } messages to keep track of how things are going.
when it starts up, it uses net_adm:ping() to test each of its expected nodes,
constructing a list of available nodes.  when it gets a nodedown, it removes that
node from its list of available nodes.  when the node comes back up, it checks in,
the master adds it to the list, and everyone's happy.

the system's been running without much trouble since 2001, but recently we've
a couple of cases where network outages (we think) have caused the master to lose
all of its nodes.  i.e. it gets nodedown messages for each node, all at the same 
time.  since the nodes are actually healthy, they don't restart (and thus check in).
and the master only checks its "expected" nodes at startup.  so until someone
restarts the master, it does nothing useful, replying to each query with a noServerAvailable
reply.  on restart, it pings its nodes, they reply, and all is well.

so i'm looking for ideas on how to deal with this.  i'm not sure the extent of the outages.
i could notice when my nodes are gone, and recheck my "expected nodes" list.  but the
network may not be back yet.  i could have the master loop periodically recheck its nodes,
but that seems like overkill in the general case.  i could spawn another process to watch
for the nodes coming back, and notify the master.  i suppose i could do a lot of things,
but i wonder what the best practices for this kind of thing are?

much as i'd love to rewrite it all using otp, that's probably not an option at this point.
suggestions?

----
Garry Hodgson, Senior Software Geek, AT&T CSO

nobody can do everything, but everybody can do something.
do something.




More information about the erlang-questions mailing list