[erlang-questions] network outage recovery

Wed Jun 27 16:44:48 CEST 2007

Hi Garry.

I have a similar setup to yours, where compute processes send a master
process a 'ready' message to let the master know they are available
for a task.  They send another 'ready' message each time they have
completed that task.

I get around the problem of network failures/master process failure by
using an 'after' clause in my compute processes' 'receive' blocks.  If
they haven't had a request from the master in some time they send
another 'ready' message.

On the master's side, each time it picks up a 'ready' message to send
a compute process a task, it firstly uses another 'receive' block loop
to flush out any duplicate 'ready' messages from that process.  This
avoids having the master send multiple tasks to the one compute
process while it is already working on a task.

This is just how I evolved my code and may not count as a 'best
practice'.  I am planning on rewriting it all soon to take advantage
of all those OTP behaviours.

Hope this helps,
Philip

On 6/26/07, Garry Hodgson <garry@REDACTED> wrote:
> we've encountered a problem recently in an old erlang system, and i could
> use some thoughts about the best way to resolve it.
>
> the system has a master server, which keeps track of a set of compute nodes
> that it delegates requests to.  it was written before i knew about otp, so the master
> relies on { nodedown, Node } messages to keep track of how things are going.
> when it starts up, it uses net_adm:ping() to test each of its expected nodes,
> constructing a list of available nodes.  when it gets a nodedown, it removes that
> node from its list of available nodes.  when the node comes back up, it checks in,
> the master adds it to the list, and everyone's happy.
>
> the system's been running without much trouble since 2001, but recently we've
> a couple of cases where network outages (we think) have caused the master to lose
> all of its nodes.  i.e. it gets nodedown messages for each node, all at the same
> time.  since the nodes are actually healthy, they don't restart (and thus check in).
> and the master only checks its "expected" nodes at startup.  so until someone
> restarts the master, it does nothing useful, replying to each query with a noServerAvailable
> reply.  on restart, it pings its nodes, they reply, and all is well.
>
> so i'm looking for ideas on how to deal with this.  i'm not sure the extent of the outages.
> i could notice when my nodes are gone, and recheck my "expected nodes" list.  but the
> network may not be back yet.  i could have the master loop periodically recheck its nodes,
> but that seems like overkill in the general case.  i could spawn another process to watch
> for the nodes coming back, and notify the master.  i suppose i could do a lot of things,
> but i wonder what the best practices for this kind of thing are?
>
> much as i'd love to rewrite it all using otp, that's probably not an option at this point.
> suggestions?
>
> ----
> Garry Hodgson, Senior Software Geek, AT&T CSO
>
> nobody can do everything, but everybody can do something.
> do something.
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>