[erlang-questions] massive distribution

Tue Dec 1 23:34:27 CET 2009

On Tue, Dec 1, 2009 at 3:44 PM, Peter Sabaini <peter@REDACTED> wrote:
> On Tue, 2009-12-01 at 10:27 -0600, Garrett Smith wrote:
>> What is happening that makes something an unstable connection?
>
> The behaviour was that nodes seemed to randomly produced error messages,
> eg.:
>
> =ERROR REPORT==== 9-Jul-2009::13:56:07 ===
> The global_name_server locker process received an unexpected message:
> {{#Ref<0.0.0.1957>,'xy@REDACTED'},false}
>
> Or
>
> =ERROR REPORT==== 9-Jul-2009::14:03:33 ===
> global: 'foo@REDACTED' failed to connect to 'qux@REDACTED'

Hmm...not to say the node count isn't part of the problem, but there
are *lots* of reasons this could happen, none of which have anything
to do with Erlang.

> In my test I just tried to run "fully-meshed" ie. every node is
> connected to every other node; I ran 50 - 120 nodes distributed across 5
> physical machines on a local, otherwise healthy, LAN.

How often were you running into the "failed to connect" error above
with ~120 nodes? Can you rpc to the nodes and get them to reconnect?

> As you say, running "fully-meshed" is a lot of overhead, which might not
> be necessary in an actual deployment. On the other hand, the
> near-automatic network setup is also very convenient :-)

True. But it's also a bit annoying to see all of the nodes busily
trying to hook up, when I really don't need or want them to. As I've
mentioned before, I use -connect_all false but then lose the global
process registry. What's the solution? I probably just need to dig
deeper.

>> I have a mesh of several dozen nodes and the connections can drop at
>> any time given the basic unreliability of network connections.
>
> TCP/IP in a local LAN should be way more reliable than that.

But not 100%. Something's going to fall over at some point and you're
going to need to deal with it. I use a state machine that monitors a
node's connection to another and goes into a retry mode when something
drops. This works well.

My concern is in how the myriad distributed features of Erlang (global
process registry, again, being just one example) deals with large
meshes. If the errors you're seeing are revealing a problem with
Erlang in large networks, it'd be interesting to get to the underlying
cause.

I think ultimately it's crazy to expect n(n-1) to scale, so at some
point the thing needs to be partitioned.

Garrett