<div dir="ltr">The current state of the art for deciding who decides whether a node is down depends on what you mean by 'decides', 'is', and 'down'. <div><br></div><div>It turns out to be quite difficult to determine that a node is down; for example, an issue with networking could cause a node to appear down to some other nodes, but not others.  Is the node down?  Who can say.</div><div><br></div><div>One way of dealing with the problem is to centralize the work distribution, and have slave workers check out jobs and check in results.  Any worker that doesn't check in within a particular timeout period is assumed to have died, and the work is returned to the queue.  Any worker that 'recovers' and tries to check in its overdue work is ignored.</div><div><br></div><div>This is obviously a single point of failure and doesn't scale further than your single master machine can handle work.  That said, it probably works for 95% of job distribution problems in the real world.</div><div><br></div><div>Beyond that, you get into the world of needing leader election and quorums.  Basho has done a great deal of work in this area -- see, e.g. riak_ensemble.  Another approach might be to use, eg., zookeeper or etcd as work queues.  It's a very complicated area, but if you need it, you need it. </div><div><br></div><div>Pragmatically I'd recommend going with the single master unless you can't have unavailability until operator intervention when the master crashes.  It's conceptually and operationally easier to deal with, unless you truly are near that scaling wall.</div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Aug 10, 2015 at 9:36 AM, Roger Lipscombe <span dir="ltr"><<a href="mailto:roger@differentpla.net" target="_blank">roger@differentpla.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I've got a situation where I have a cluster of nodes.<br>

<br>

What's the current state of the art for deciding who decides whether a<br>

node is down? To rephrase: are there any good algorithms (or Erlang<br>

libraries) that decide which subset of nodes should monitor another<br>

(all other?) nodes? I don't want every node monitoring every node (or<br>

do I?)<br>

<br>

Also, once they've detected a failure, how to distribute the dead node's work?<br>

<br>

By work, each node is running a *large* number of different long-lived<br>

jobs. If one of the nodes dies, I need to distribute those jobs fairly<br>

across the other nodes in the cluster. A single job should not run in<br>

more than one place.<br>

<br>

Assume that every node knows about every other node's assigned work,<br>

either through some kind of gossip protocol, or through a shared<br>

store.<br>

<br>

I'm kinda assuming that the monitoring nodes will hold a quick<br>

election, so that there's only a single arbiter, but anything that<br>

shows how to do that without a single leader would be good too.<br>

<br>

Thanks,<br>

Roger.<br>

_______________________________________________<br>

erlang-questions mailing list<br>

<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-questions" rel="noreferrer" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

</blockquote></div><br></div>