[erlang-questions] gen_leader discrepancies in reporting of downed nodes across a cluster

Tue Nov 27 18:47:52 CET 2012

Hi,

I'm using the gen_leader behaviour from [1] in a 3 node Erlang cluster. I'm
running into a situation where if I down one of the nodes and bring it back
up, when it rejoins the cluster the other nodes still see it as being down
as reported by gen_leader:down/1. However the cycled node itself sees the
other two nodes as being up. If I cycle the other two nodes, then all three
will agree again on all of the nodes being available. This doesn't happen
all every time I down a node, but quite often. Another (related?) issue I
sometimes see is that gen_leader:down/1 sometimes reports the same node as
being down multiple times in the returned list.

The node that is being misreported as being down is still able to make
requests to the leader and when I cycle the other nodes leader election
appears to behave normally. Any ideas on the misreporting of the downed
node? The misreporting of the downed node makes me think that the leader
election may not be working correctly and that the cluster is in an invalid
or inconsistent state.

[1]: https://github.com/abecciu/gen_leader_revival

--
Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121127/78945d7b/attachment.htm>