[erlang-questions] massive distribution

Wed Dec 2 11:07:36 CET 2009

On Tue, 2009-12-01 at 16:34 -0600, Garrett Smith wrote:
> On Tue, Dec 1, 2009 at 3:44 PM, Peter Sabaini <peter@REDACTED> wrote:
> > On Tue, 2009-12-01 at 10:27 -0600, Garrett Smith wrote:
> >> What is happening that makes something an unstable connection?
> >
> > The behaviour was that nodes seemed to randomly produced error messages,
> > eg.:
> >
> > =ERROR REPORT==== 9-Jul-2009::13:56:07 ===
> > The global_name_server locker process received an unexpected message:
> > {{#Ref<0.0.0.1957>,'xy@REDACTED'},false}
> >
> > Or
> >
> > =ERROR REPORT==== 9-Jul-2009::14:03:33 ===
> > global: 'foo@REDACTED' failed to connect to 'qux@REDACTED'
> 
> Hmm...not to say the node count isn't part of the problem, but there
> are *lots* of reasons this could happen, none of which have anything
> to do with Erlang.

My evidence is that these problems appeared (with little load) when I
increased the nodecount, and disappeared (even under heavy load) when
going beyond a threshold (seemed stable with 64 nodes in my case). I
didn't investigate this further though as I was more interested in the
behaviour of my application. 

> > In my test I just tried to run "fully-meshed" ie. every node is
> > connected to every other node; I ran 50 - 120 nodes distributed across 5
> > physical machines on a local, otherwise healthy, LAN.
> 
> How often were you running into the "failed to connect" error above
> with ~120 nodes? Can you rpc to the nodes and get them to reconnect?

I don't have the setup at hand anymore, but as far as I can remember
with 120 nodes the connection errors would occur quite frequently, and
the nodes would then be considered down. 

> > As you say, running "fully-meshed" is a lot of overhead, which might not
> > be necessary in an actual deployment. On the other hand, the
> > near-automatic network setup is also very convenient :-)
> 
> True. But it's also a bit annoying to see all of the nodes busily
> trying to hook up, when I really don't need or want them to. As I've
> mentioned before, I use -connect_all false but then lose the global
> process registry. What's the solution? I probably just need to dig
> deeper.
> 
> >> I have a mesh of several dozen nodes and the connections can drop at
> >> any time given the basic unreliability of network connections.
> >
> > TCP/IP in a local LAN should be way more reliable than that.
> 
> But not 100%. Something's going to fall over at some point and you're
> going to need to deal with it. I use a state machine that monitors a
> node's connection to another and goes into a retry mode when something
> drops. This works well.
> My concern is in how the myriad distributed features of Erlang (global
> process registry, again, being just one example) deals with large
> meshes. If the errors you're seeing are revealing a problem with
> Erlang in large networks, it'd be interesting to get to the underlying
> cause.

Right, thats what I was getting at -- there is no such thing as absolute
reliability, but OTOH the connection problems I observed cannot be
explained by TCP/IP unreliability. 

Yeah, would be interesting to explore this further. I was pretty
strapped for time when I tried this, but maybe I get around to do this
one of these days.

peter.

> I think ultimately it's crazy to expect n(n-1) to scale, so at some
> point the thing needs to be partitioned.
> 
> Garrett