Divergence in globally registered names

saket chaudhary saketcmf@REDACTED
Tue Oct 13 20:31:06 CEST 2020


Thanks Daniel for the explanation. The fact that convergence has to be
forced manually sounds like a deal-breaker for me. What would be a good
alternative to 'global'?

On Mon, Oct 12, 2020 at 2:35 PM Dániel Szoboszlay <dszoboszlay@REDACTED>
wrote:

> Hi,
>
> Global can indeed end up in inconsistent states if some nodes get
> disconnected from each other (so you're no longer running on a fully
> connected mesh). Since when registering a global name on node X the change
> is only propagated to nodes that X are directly connected to, you can end
> up in a situation that X and Y are connected together, so they will both
> know about the name, and Y and Z are connected together but X and Z are
> not, so Z never gets the update.
>
> When two nodes (re)connect, they only compare the names they locally know
> about. So it is a bit tricky, but you can actually end up in a situation
> when all nodes are connected, yet the global name databases are
> inconsistent. You will need at least 4 nodes for this scenario to happen
> (e.g. A, B, C & D):
>
>    1. All nodes are connected initially.
>    2. A gets disconnected from C.
>    3. A registers process X under some name: this gets propagated to B &
>    D, but not C.
>    4. B gets disconnected from D.
>    5. B re-registers process Y under some name: this gets propagated to A
>    & C, but not D, so on D the name still belongs to X.
>    6. A reconnects to C, since they both know the name belongs to Y they
>    will inform their half of the network about the new node, but won't issue
>    any global name updates.
>    7. You have all 4 nodes connected again, but A, B & C believe the name
>    belongs to Y, while D believes it belongs to X.
>
> So this can happen, if you know how global works you can understand how it
> can happen, but I don't think it would be expected by many people to
> actually happen. :)
>
> global:sync() is not really meant to resolve this error. The only solution
> I know about is to manually compare global name registrations shortly after
> you see a new node connecting.
>
> Cheers,
> Daniel
>
>
> On Mon, 12 Oct 2020 at 09:23, saket chaudhary <saketcmf@REDACTED> wrote:
>
>> We hit upon an issue in production where two erlang nodes in the same
>> cluster agreed on the set of neighbour nodes (nodes() call) but diverged on
>> the globally registered names (global:registered_name()). We're running OTP
>> 23.0.2, but have hit these issues infrequently in the past with OTP17 as
>> well.
>>
>> Calling global:sync() or even net_adm:ping/1 for the remote node that had
>> the globally registered process didn't help. We verified global
>> registration of new names was being propagated across all the nodes.
>> However, it didn't help fix the old names that had diverged. Ultimately, we
>> had to manually re-register the name using a remote shell.
>>
>> Does anyone know if this is expected? I thought the erlang nodes would
>> gossip their way through to resolve any inconsistencies.
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20201014/5af76fe1/attachment.htm>


More information about the erlang-questions mailing list