[erlang-questions] gen_leader discrepancies in reporting of downed nodes across a cluster

Wed Nov 28 05:35:06 CET 2012

On Tue, Nov 27, 2012 at 12:47:52PM -0500, Jeremy Raymond wrote:
> Hi,
> 
> I'm using the gen_leader behaviour from [1] in a 3 node Erlang cluster. I'm
> running into a situation where if I down one of the nodes and bring it back
> up, when it rejoins the cluster the other nodes still see it as being down
> as reported by gen_leader:down/1. However the cycled node itself sees the
> other two nodes as being up. If I cycle the other two nodes, then all three
> will agree again on all of the nodes being available. This doesn't happen
> all every time I down a node, but quite often. Another (related?) issue I
> sometimes see is that gen_leader:down/1 sometimes reports the same node as
> being down multiple times in the returned list.
> 

Would you mind trying the branch at 

https://github.com/Vagabond/gen_leader_revival/tree/netsplit-tolerance

This branch contains a bunch of work I did to work around these kinds
of issues that Basho was seeing with gen_leader.

Anfrew