erlang cluster partitioned

Mon Nov 29 11:19:34 CET 2021

>
> We have a set of designated erlang nodes that every other node routinely
> pings and the hope is that they'll all discover the rest of each other
> through them.
>

The discovery of neighbours only happens when establishing connection to a
node. If I understand correctly, you do repeated pings to restore broken
connections too, but that won't work. E.g. if node A and B are connected,
and a freshly started node C pings B, it will learn about A and connect to
it, forming a full mesh. However, if later the connection between A and C
breaks down, but B remains connected to both A and C, pinging B from either
A or C won't help restoring the broken connection.

> Restarting this subset didn't really help.
>
This is strange, restarting a node would imply it makes a new connection to
a central node and from there should be able to discover and connect to the
entire network. Are you sure there is nothing like a firewall on the
network that would prevent island nodes from making outgoing connections to
the non-central nodes, but would let all nodes make outgoing connections to
the island nodes?

We also ran into problems with global registration calls that were stuck on
> those island nodes but we think that's more of a symptom than a cause as we
> ran partitioned.
>
If a global call is stuck, it suggests the global_name_server processes
know about all the nodes, but can't communicate with them. Or maybe the
global_name_server processes run into some kind of deadlock waiting for
each other and thus not handling requests? That could even explain why
restarting the nodes one-by-one didn't help (the neighbour discovery is the
responsibility of gloal_name_servers).

I'd suggest finding out what the global_name_servers are up to if this
problem occurs again. You can check with process_info whether they are
idling (the current function being gen_server:loop/7) or not, then query
the state of idle processes with sys:get_state/1 and the entire call stack
of busy processes with process_info(whereis(global_name_server),
backtrace). Inspecting the global_* ETS tables may also help understanding
the situation.

Cheers,
Daniel

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20211129/cb3d20a4/attachment.htm>