erlang cluster partitioned

saket chaudhary saketcmf@REDACTED
Sat Nov 27 17:10:36 CET 2021


We hit an issue in production where our cluster of around twenty odd nodes
ran partitioned asymmetrically. We have a set of designated erlang nodes
that every other node routinely pings and the hope is that they'll all
discover the rest of each other through them. We observed that while these
central nodes knew about all others but the info about neighbors wasn't
getting propagated despite pinging (via net_adm:ping) them (nodes() call on
the island nodes returned an incomplete result with just the central
nodes). Restarting this subset didn't really help. Every other node outside
of this island was fully connected. We had to eventually restart the entire
cluster to fix. We also ran into problems with global registration calls
that were stuck on those island nodes but we think that's more of a symptom
than a cause as we ran partitioned.

This has happened more than once in the last month. Any clues to what could
be going wrong? We're using OTP-23.1.5.3
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20211127/ed7c041f/attachment.htm>


More information about the erlang-questions mailing list