[erlang-questions] large Erlang clusters

Serge Aleynikov saleyn@REDACTED
Sun Aug 17 14:22:00 CEST 2008

Thanks Matt, I am indeed familiar with that net_kernel's setting.

My current reasoning for large clusters is as follows.  If all 
inter-node communications are quite busy, this net_kernel's pinging 
doesn't add any extra overhead for them.  However if several busy nodes 
are also connected to a large number of quiet nodes, they'll have to 
coop with extra background pinging load.

Perhaps that load is not to worry about when the "close to" real-time 
response is expected of the "busy" nodes, as in case of an I/O busy node 
signaling select/poll calls will likely happen simultaneously on several 
file descriptors, with some of them carrying pinging payload.  As a 
result the total number of system calls involved in processing these 
events wouldn't be as many as if the events were spatially separated. 
However, since all net_ticktime logic happens in the bytecode, in a busy 
node it steals CPU cycles from other time-sensitive code, leading to 
increased latencies.  This may also be a constraint if other OS 
processes on the same host have higher priority then the emulator and 
the emulator is not allowed to run with SMP mode on so that it doesn't 
compete with CPU resources of other more critical processes.

This background load can be reduced by either bumping up the 
net_ticktime setting (this is quite tricky if part of nodes are located 
in one subnet and others are in a more distant network - unfortunately 
with this feature *all* nodes in the cluster must have the same setting) 
or making sure that global/net_kernel don't maintain a full mesh of 
interconnected nodes using global_groups. In the past I used the later 
approach (together with dist_auto_connect net_kernel's setting) to make 
Erlang applications work through firewalls.  In case of node 
partitioning applications that take advantage of failover and global 
registration must be careful not to deal with several global groups as 
global names may be moved between them during failover.

I was hoping to hear in this thread experiences of those currently 
running large Erlang clusters in production.  In my prior Erlang 
deployments I haven't had cases of more than 30-40 node clusters (where 
total number of OS processes were much larger than that using ei/tcp to 
connect to Erlang, but the Erlang applications would not spread over 
more than 40 hosts).  Since now there is a case for running a 400 node 
cluster (an instance of a VM per host) in a much larger local network 
(with a growth potential to about 600-700 nodes), I want to make sure 
that I won't miss something obvious to those who had similar experiences.



Matt Williamson wrote:
> If you check out net_ticktime in the kernel_app docs, (you can set it with
> net_kernel:set_net_ticktime/1,2), you'll see:
> "Once every TickTime/4 second, all connected nodes are ticked (if anything
> else has been written to a node) and if nothing has been received from
> another node within the last four (4) tick times that node is considered to
> be down..."
> The default ticktime is 60s, meaning a ping every 15 seconds.
> On Sat, Aug 16, 2008 at 5:24 PM, Serge Aleynikov <saleyn@REDACTED> wrote:
>> I suppose that the problem with the max number of sockets is solved by
>> tweaking session limits (ulimit) and using kernel poll (+K true).
>> As I understand, in a 600 node cluster every node will maintain
>> connections to the rest 599 nodes, and send periodic pings.  So, that
>> pinging overhead would be something in the order of 10 events per second
>>  per node in this configuration.  While the number doesn't seem
>> intimidating I wonder if that overhead becomes noticeable in large
>> network configurations and if there are any other guidelines that help
>> architect such large network clusters to keep background load minimal.
>> Serge
>> Viktor Sovietov wrote:
>>> Hi Serge
>>> As far as I know you're only limited with the maximum number of sockets
>>> which are available on your system and with number of atoms which can be
>>> used as node names.
>>> We tested 600 nodes cluster, but I honestly can't recall if there were
>> any
>>> patches to BEAM to increase mentioned parameters.
>>> Sincerely,
>>> --Viktor
>>> Serge Aleynikov-2 wrote:
>>>> Does any one have experience running somewhere between 200 and 400 nodes
>>>> in production?  I recall that Erlang distributed layer had a limit of
>>>> 256 nodes.  Is it still the case?
>>>> I suppose that partitioning the cluster in several global_groups should
>>>> limit the network load and the number of open file descriptors on each
>>>> node would be reduced.
>>>> Are there any other concerns one should be aware of when working with
>>>> such large clusters.
>>>> Serge
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> erlang-questions@REDACTED
>>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions

More information about the erlang-questions mailing list