[erlang-questions] massive distribution

Tue Dec 1 22:44:53 CET 2009

On Tue, 2009-12-01 at 10:27 -0600, Garrett Smith wrote:
> On Tue, Dec 1, 2009 at 8:22 AM, Peter Sabaini <peter@REDACTED> wrote:
> > On Tue, 2009-12-01 at 09:08 -0500, Kevin A. Smith wrote:
> >> Fully connected meshes suck for large numbers of nodes. Erlang provides a number of
> >> knobs to control how a cluster is stitched together such as "-connect_all false"
> >> and "-hidden".
> >
> > Which would entail keeping track of connected nodes and connection
> > establishment/teardown, correct?
> >
> >> Also, tuning the net tick time (see man 3 net_kernel and man 6 kernel) can be helpful
> >> in keeping a large cluster running.
> >
> > I fiddled around with those a bit. I don't have the exact values at
> > hand, but I set net_ticktime to rather large values, something like
> > 300s, without substantial improvements in the number of nodes able to
> > keep a stable connection.
> 
> What is happening that makes something an unstable connection?

The behaviour was that nodes seemed to randomly produced error messages,
eg.:

=ERROR REPORT==== 9-Jul-2009::13:56:07 ===
The global_name_server locker process received an unexpected message:
{{#Ref<0.0.0.1957>,'xy@REDACTED'},false}

Or 

=ERROR REPORT==== 9-Jul-2009::14:03:33 ===
global: 'foo@REDACTED' failed to connect to 'qux@REDACTED'

In my test I just tried to run "fully-meshed" ie. every node is
connected to every other node; I ran 50 - 120 nodes distributed across 5
physical machines on a local, otherwise healthy, LAN. 

As you say, running "fully-meshed" is a lot of overhead, which might not
be necessary in an actual deployment. On the other hand, the
near-automatic network setup is also very convenient :-)

> I have a mesh of several dozen nodes and the connections can drop at
> any time given the basic unreliability of network connections. 

TCP/IP in a local LAN should be way more reliable than that.

peter.

> Each
> node, however, is responsible for trying to reestablish a connection
> to a well known 'hub', which tends to keep the mesh in tact even when
> some nodes fall off occasionally. (This is a single point of failure,
> but the 'hub' could easily be a list, like DNS.)
> 
> I've found that setting -connect_all false disables the global process
> registry, which makes the setting practically useless. I'm guess I've
> missed something here. What is the approach to keeping the global
> registry in sync when -connect_all false is set?
> 
> I've also read about, but not explored, a pattern of segmenting a mesh
> into smaller groups of nodes. From what I understand -- that each node
> tries to connect to each node -- a mesh has m(n-1)/2 connections, so
> 80 nodes would imply 3000+ connections. For most applications, that's
> a lot of unneeded overhead -- not ever node is going to need to talk
> to every other node.
> 
> When networks are small, Erlang's global process registration and
> lookup facility is phenomenal. But the out-of-the-box scheme
> definitely presents challenges in large networks.
> 
> I'm definitely curious to know how others have dealt with this type of problem.
> 
> Garrett

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20091201/0dc4fa26/attachment.bin>