Divergence in globally registered names
Mon Oct 12 11:05:19 CEST 2020
Global can indeed end up in inconsistent states if some nodes get
disconnected from each other (so you're no longer running on a fully
connected mesh). Since when registering a global name on node X the change
is only propagated to nodes that X are directly connected to, you can end
up in a situation that X and Y are connected together, so they will both
know about the name, and Y and Z are connected together but X and Z are
not, so Z never gets the update.
When two nodes (re)connect, they only compare the names they locally know
about. So it is a bit tricky, but you can actually end up in a situation
when all nodes are connected, yet the global name databases are
inconsistent. You will need at least 4 nodes for this scenario to happen
(e.g. A, B, C & D):
1. All nodes are connected initially.
2. A gets disconnected from C.
3. A registers process X under some name: this gets propagated to B & D,
but not C.
4. B gets disconnected from D.
5. B re-registers process Y under some name: this gets propagated to A &
C, but not D, so on D the name still belongs to X.
6. A reconnects to C, since they both know the name belongs to Y they
will inform their half of the network about the new node, but won't issue
any global name updates.
7. You have all 4 nodes connected again, but A, B & C believe the name
belongs to Y, while D believes it belongs to X.
So this can happen, if you know how global works you can understand how it
can happen, but I don't think it would be expected by many people to
actually happen. :)
global:sync() is not really meant to resolve this error. The only solution
I know about is to manually compare global name registrations shortly after
you see a new node connecting.
On Mon, 12 Oct 2020 at 09:23, saket chaudhary <saketcmf@REDACTED> wrote:
> We hit upon an issue in production where two erlang nodes in the same
> cluster agreed on the set of neighbour nodes (nodes() call) but diverged on
> the globally registered names (global:registered_name()). We're running OTP
> 23.0.2, but have hit these issues infrequently in the past with OTP17 as
> Calling global:sync() or even net_adm:ping/1 for the remote node that had
> the globally registered process didn't help. We verified global
> registration of new names was being propagated across all the nodes.
> However, it didn't help fix the old names that had diverged. Ultimately, we
> had to manually re-register the name using a remote shell.
> Does anyone know if this is expected? I thought the erlang nodes would
> gossip their way through to resolve any inconsistencies.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions