[erlang-questions] UDP receive performance

Danil Zagoskin z@REDACTED
Thu Jun 21 15:49:17 CEST 2018


Hi!
Sorry for the late reply.



The patch works like a charm. Huge performance boost!
Before:

        Thread    alloc      aux      bifbusy_wait check_io emulator
   ets       gc  gc_full      nif    other     port     send    sleep
 timers
Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
 0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
  0.00%
           aux    0.04%    0.05%    0.00%    0.00%    2.02%    0.00%
 0.00%    0.00%    0.00%    0.00%    0.01%    0.00%    0.00%   97.88%
  0.00%
dirty_cpu_sche    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
 0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
  0.00%
dirty_io_sched    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
 0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
  0.00%
          poll    0.04%    0.00%    0.00%    0.00%    6.03%    0.00%
 0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   93.93%
  0.00%
     scheduler   56.63%    0.80%    0.48%   16.13%    0.00%    4.07%
 0.00%    1.24%    0.00%    0.00%    2.69%   14.89%    0.00%    2.80%
  0.25%


After:

        Thread    alloc      aux      bifbusy_wait check_io emulator
   ets       gc  gc_full      nif    other     port     send    sleep
 timers
Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
 0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
  0.00%
           aux    0.00%    0.00%    0.00%    0.00%   14.57%    0.00%
 0.00%    0.00%    0.00%    0.00%    1.24%    0.00%    0.00%   84.19%
  0.00%
dirty_cpu_sche    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
 0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
  0.00%
dirty_io_sched    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
 0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
  0.00%
          poll    0.00%    0.00%    0.00%    0.00%    3.10%    0.00%
 0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   96.90%
  0.00%
     scheduler    2.86%    0.78%    0.38%   57.22%    0.00%    2.81%
 0.00%    0.49%    0.00%    0.00%    7.32%    9.87%    0.00%   18.08%
  0.18%


The machine is quad-core i5-6400@REDACTED with Linux 4.13.0
Load generated with this commands (10 senders, 10 readers on each address):

[udptest:start_sender({239,9,9,X}, 3999) || X <- lists:seq(1,10)],
[udptest:start_reader({239,9,9,X}, 3999) || X <- lists:seq(1,10), _ <-
lists:seq(1, 10)].


Thank you for the awesome work!

P.S. For some reason there is an increased check_io time in aux thread, but
I think it's not critical and may be related to more often socket reads due
to lots of idle time.

On Wed, Jun 20, 2018 at 2:56 PM Jonas Falkevik <
jonas.falkevik@REDACTED> wrote:

>
> On Mon, Jun 18, 2018 at 3:57 PM, Lukas Larsson <lukas@REDACTED> wrote:
>
>>
>>
>>>> Something like
>>> https://github.com/falkevik/otp/commit/444fb00ff2a9d1f40a8c66f48bea1cf3f07ca86c
>>> ?
>>>
>>
>> I was thinking something more like this:
>> https://github.com/garazdawi/otp/tree/lukas/erts/realloc_thr_pref_shrink_th
>>
>> This makes it so that the absolute and relative single block shrink
>> thesholds are respected for reallocs made on a remote scheduler. This
>> should solve the problem that you have found also, through as I still
>> haven't reproduced it I can't test that it actually solves it.
>>
>>
> Are you adding the multicast network to the loop back interface? Using
> some other interface that does not allow multicast traffic?
>
> I have been able to reproduce it in macOS Sierra (Darwin Kernel Version
> 16.7.0: Fri Apr 27 17:59:46 PDT 2018; root:xnu-3789.73.13~1/RELEASE_X86_64
> x86_64)
> and also on Ubuntu Linux 16.04 LTS with kernel 4.4.0-112-generic.
>
> Danil Zagoskin, do you have time to try and see if there is any change in
> the behaviour for you with a patched system?
>
>
> With the bugfix, the beam is spending most of its time in sched_spin_wait
> on linux and sched_yield on macOS.
>
> Stats below are from OTP 21 and then OTP21 with bugfix on linux
>
> Erlang/OTP 21 [erts-10.0] [source] [64-bit] [smp:8:8] [ds:8:8:10]
> [async-threads:1] [hipe]
>
> Eshell V10.0  (abort with ^G)
> 1> udptest:start_sender({239,9,9,9}, 3999).
> <0.78.0>
> 2> [udptest:start_reader({239,9,9,9}, 3999) || _ <- lists:seq(1, 40)].
> [<0.80.0>,<0.81.0>,<0.82.0>,<0.83.0>,<0.84.0>,<0.85.0>,
>  <0.86.0>,<0.87.0>,<0.88.0>,<0.89.0>,<0.90.0>,<0.91.0>,
>  <0.92.0>,<0.93.0>,<0.94.0>,<0.95.0>,<0.96.0>,<0.97.0>,
>  <0.98.0>,<0.99.0>,<0.100.0>,<0.101.0>,<0.102.0>,<0.103.0>,
>  <0.104.0>,<0.105.0>,<0.106.0>,<0.107.0>,<0.108.0>|...]
> 3> msacc:start(10000), msacc:print().
> Average thread real-time    : 10002138 us
> Accumulated system run-time : 62576106 us
> Average scheduler run-time  :  7775309 us
>
>         Thread      aux check_io emulator       gc    other     port
>  sleep
>
> Stats per thread:
>      async( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
>        aux( 1)    0.06%    0.96%    0.00%    0.00%    0.07%    0.00%
> 98.90%
> dirty_cpu_( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 4)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 5)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 6)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 4)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 5)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 6)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 9)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
>       poll( 0)    0.00%    2.63%    0.00%    0.00%    0.00%    0.00%
> 97.37%
>  scheduler( 1)    0.82%    0.00%    0.52%    0.35%    7.33%   67.91%
> 23.08%
>  scheduler( 2)    0.76%    0.00%    0.79%    0.34%    7.30%   68.61%
> 22.21%
>  scheduler( 3)    0.74%    0.00%    0.83%    0.33%    7.17%   68.54%
> 22.38%
>  scheduler( 4)    0.76%    0.00%    0.72%    0.34%    7.38%   68.84%
> 21.96%
>  scheduler( 5)    0.79%    0.00%    0.82%    0.34%    7.21%   68.94%
> 21.90%
>  scheduler( 6)    0.80%    0.00%    0.74%    0.31%    7.15%   68.56%
> 22.44%
>  scheduler( 7)    0.87%    0.00%    0.72%    0.37%    7.24%   68.43%
> 22.36%
>  scheduler( 8)    0.77%    0.00%    0.68%    0.35%    7.41%   69.02%
> 21.77%
>
> Stats per type:
>          async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
>            aux    0.06%    0.96%    0.00%    0.00%    0.07%    0.00%
> 98.90%
> dirty_cpu_sche    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_sched    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
>           poll    0.00%    2.63%    0.00%    0.00%    0.00%    0.00%
> 97.37%
>      scheduler    0.79%    0.00%    0.73%    0.34%    7.27%   68.61%
> 22.26%
> ok
> 4>
>
>
> $ ../src/udp_performace_bug_fix/bin/erl
> Erlang/OTP 21 [RELEASE CANDIDATE 2] [erts-9.3.2] [source] [64-bit]
> [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]
>
> Eshell V9.3.2  (abort with ^G)
> 1> udptest:start_sender({239,9,9,9}, 3999).
> <0.78.0>
> 2> [udptest:start_reader({239,9,9,9}, 3999) || _ <- lists:seq(1, 40)].
> [<0.80.0>,<0.81.0>,<0.82.0>,<0.83.0>,<0.84.0>,<0.85.0>,
>  <0.86.0>,<0.87.0>,<0.88.0>,<0.89.0>,<0.90.0>,<0.91.0>,
>  <0.92.0>,<0.93.0>,<0.94.0>,<0.95.0>,<0.96.0>,<0.97.0>,
>  <0.98.0>,<0.99.0>,<0.100.0>,<0.101.0>,<0.102.0>,<0.103.0>,
>  <0.104.0>,<0.105.0>,<0.106.0>,<0.107.0>,<0.108.0>|...]
> 3> msacc:start(10000), msacc:print().
> Average thread real-time    : 10000416 us
> Accumulated system run-time : 19541445 us
> Average scheduler run-time  :  2279822 us
>
>         Thread      aux check_io emulator       gc    other     port
>  sleep
>
> Stats per thread:
>      async( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
>        aux( 1)    0.00%   10.72%    0.00%    0.00%    0.00%    0.00%
> 89.28%
> dirty_cpu_( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 4)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 5)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 6)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_cpu_( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 4)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 5)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 6)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s( 9)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_s(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
>       poll( 0)    0.00%    2.31%    0.00%    0.00%    0.00%    0.00%
> 97.69%
>  scheduler( 1)    0.46%    0.00%    0.43%    0.14%   17.57%    4.07%
> 77.33%
>  scheduler( 2)    0.50%    0.00%    0.44%    0.15%   17.13%    4.21%
> 77.57%
>  scheduler( 3)    0.49%    0.00%    0.47%    0.16%   17.07%    4.35%
> 77.46%
>  scheduler( 4)    0.45%    0.00%    0.42%    0.14%   16.25%    4.02%
> 78.72%
>  scheduler( 5)    0.48%    0.00%    0.42%    0.14%   17.04%    4.04%
> 77.88%
>  scheduler( 6)    0.46%    0.00%    0.39%    0.14%   16.57%    3.77%
> 78.67%
>  scheduler( 7)    0.53%    0.00%    1.88%    0.27%   17.60%    8.22%
> 71.50%
>  scheduler( 8)    0.47%    0.00%    0.42%    0.15%   16.60%    3.86%
> 78.50%
>
> Stats per type:
>          async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
>            aux    0.00%   10.72%    0.00%    0.00%    0.00%    0.00%
> 89.28%
> dirty_cpu_sche    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
> dirty_io_sched    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%
>  100.00%
>           poll    0.00%    2.31%    0.00%    0.00%    0.00%    0.00%
> 97.69%
>      scheduler    0.48%    0.00%    0.61%    0.16%   16.98%    4.57%
> 77.20%
> ok
>
> Jonas
>
>

-- 
Danil Zagoskin | z@REDACTED
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180621/b0affc54/attachment.htm>


More information about the erlang-questions mailing list