Divergence in globally registered names

Dániel Szoboszlay dszoboszlay@REDACTED
Mon Oct 12 11:05:19 CEST 2020


Hi,

Global can indeed end up in inconsistent states if some nodes get
disconnected from each other (so you're no longer running on a fully
connected mesh). Since when registering a global name on node X the change
is only propagated to nodes that X are directly connected to, you can end
up in a situation that X and Y are connected together, so they will both
know about the name, and Y and Z are connected together but X and Z are
not, so Z never gets the update.

When two nodes (re)connect, they only compare the names they locally know
about. So it is a bit tricky, but you can actually end up in a situation
when all nodes are connected, yet the global name databases are
inconsistent. You will need at least 4 nodes for this scenario to happen
(e.g. A, B, C & D):

   1. All nodes are connected initially.
   2. A gets disconnected from C.
   3. A registers process X under some name: this gets propagated to B & D,
   but not C.
   4. B gets disconnected from D.
   5. B re-registers process Y under some name: this gets propagated to A &
   C, but not D, so on D the name still belongs to X.
   6. A reconnects to C, since they both know the name belongs to Y they
   will inform their half of the network about the new node, but won't issue
   any global name updates.
   7. You have all 4 nodes connected again, but A, B & C believe the name
   belongs to Y, while D believes it belongs to X.

So this can happen, if you know how global works you can understand how it
can happen, but I don't think it would be expected by many people to
actually happen. :)

global:sync() is not really meant to resolve this error. The only solution
I know about is to manually compare global name registrations shortly after
you see a new node connecting.

Cheers,
Daniel


On Mon, 12 Oct 2020 at 09:23, saket chaudhary <saketcmf@REDACTED> wrote:

> We hit upon an issue in production where two erlang nodes in the same
> cluster agreed on the set of neighbour nodes (nodes() call) but diverged on
> the globally registered names (global:registered_name()). We're running OTP
> 23.0.2, but have hit these issues infrequently in the past with OTP17 as
> well.
>
> Calling global:sync() or even net_adm:ping/1 for the remote node that had
> the globally registered process didn't help. We verified global
> registration of new names was being propagated across all the nodes.
> However, it didn't help fix the old names that had diverged. Ultimately, we
> had to manually re-register the name using a remote shell.
>
> Does anyone know if this is expected? I thought the erlang nodes would
> gossip their way through to resolve any inconsistencies.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20201012/d5e92ed0/attachment.htm>


More information about the erlang-questions mailing list