[erlang-questions] Inter-node communication bottleneck

Thu Aug 21 15:58:41 CEST 2014

Spinlock works well when contension is not heavy, but it starts to fail
when there are tens of processes are sending message to same node.

Here's some updates
 - Tuning inet_default_connect_options and inet_default_listen_options
   improves performance. I tested with following option for both, and
   throughput increases from ~10MB/s to ~25MB/s
     [{delay_send,true},{high_watermark,1024000}]
   I also tried {nodelay, sndbuf, recbuf, buffer} options, but it seems
   that these option does not affact on throuphout.

 - Receiving side is also a bottleneck. I ran a benchmark with with
   different message sizes, and found that receiving node cannot utilize
   CPU more than 100% (one core). It seems that message decoding and
   distribution is done in single thread, and it cannot handle lots of
   small messages fast enough to satuate network bendwidth.

Several thoughts
 - I didn't tested yet but ~25MB/s limitations is per node, so with 4
   peer nodes I can satuate 1Gbps with smallest possible messages. The
   performance might not that bad as I thought.

 - I can simulate messaging with term_to_binary/1, binary_to_term/1, and
   tcp socket with {packet, 2} or {packet, 4} option. It lacks some
   optimization like atom table cache, but It scales.

I'll spend some time to fix the issue, but I'm not sure what I can do
more...

On Thu, Aug 21, 2014 at 01:15:32PM +0300, Dmitry Kolesnikov wrote:
> Hello,
> 
> I am not sure why you are saying that spin-lock do not impact the result.
> I saw 25% gain on peak (on both loopback and en0 interfaces).
> 
> - Dmitry
> 
> On 21 Aug 2014, at 07:05, Jihyun Yu <yjh0502@REDACTED> wrote:
> 
> > The problem is reproducable *with loopback interface*. I tested with two
> > Erlang instances on same machine, communicating via loopback interface,
> > and results are same.
> > 
> > I changed contented lock - qlock in DistEntry - to spinlock, but it does
> > not affact on throughput(messages per second). 'perf' profiling tool [1]
> > shows that CPU cycle is moved from kernel to userspace.
> > 
> > I attach patch and 10-seconds sampling result on mutex and spinlock.
> > 
> > [1] https://perf.wiki.kernel.org/
> > <spinlock.patch><perf_spinlock.report><perf_mutex.report>_______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
>