[erlang-questions] UDP receive performance

Jesper Louis Andersen jesper.louis.andersen@REDACTED
Thu May 24 23:22:17 CEST 2018


I looked up the unaligned stuff. There are no aligned variant, and the
unaligned variant just sets up a prologue before entering the main loop
where you do have alignment. So I wouldn't worry about that, but more about
where the calls are being made and where the memory is copied around.

On Thu, May 24, 2018 at 5:35 PM Danil Zagoskin <z@REDACTED> wrote:

> -    2.57%     0.02%  31_scheduler     beam.smp
> [.] process_main
>    - 2.55% process_main
>       - 2.43% erts_schedule
>          - 2.26% erts_port_task_execute
>             - 2.25% packet_inet_input.isra.31
>                - 2.05% driver_realloc_binary
>                   - 2.05% realloc_thr_pref
>                        1.87% __memmove_avx_unaligned_erms
>
> That's 40-core Xeon E5-2640, so 2.5% on a single scheduler is kind of 100%
> Also it's Linux kernel 4.9
>
>
> On a machine with kernel 4.13 and quad-core Xeon E31225 on a half of E5's
> load we have:
> -   16.11%     0.10%  1_scheduler      beam.smp                       [.]
> erts_schedule
>    - 16.01% erts_schedule
>       - 13.62% erts_port_task_execute
>          - 13.11% packet_inet_input.isra.31
>             - 11.37% driver_realloc_binary
>                - 11.33% realloc_thr_pref
>                   - 10.50% __memcpy_avx_unaligned
>                        5.06% __memcpy_avx_unaligned
>                      + 1.04% page_fault
>                     0.66% do_erts_alcu_realloc.constprop.31
>             + 0.79% 0x108f3
>               0.55% driver_deliver_term
>         1.30% sched_spin_wait
>
> Seems like kernel version may change a lot, will run more tests.
>
> But it seems like memory operations are unaligned which could be not very
> efficient.
>
> On Thu, May 24, 2018 at 1:24 PM, Lukas Larsson <lukas@REDACTED> wrote:
>
>> Can you run perf with "--call-graph dwarf" and see which functions it is
>> that call memmove?
>>
>> On Thu, May 24, 2018 at 12:21 PM, Danil Zagoskin <z@REDACTED> wrote:
>>
>>> Yes, I've built a fresh master today (Erlang/OTP 21 [RELEASE CANDIDATE
>>> 1] [erts-9.3.1]), and nothing has changed.
>>>
>>> On Thu, May 24, 2018 at 1:17 PM, Sergej Jurečko <
>>> sergej.jurecko@REDACTED> wrote:
>>>
>>>> OTP-21 rc1 has enhanced IO scalability. Have you tried if it is any
>>>> better? UDP performance in Erlang was never great...
>>>>
>>>> Regards,
>>>> Sergej
>>>>
>>>>
>>>> On 24 May 2018, at 12:03, Danil Zagoskin <z@REDACTED> wrote:
>>>>
>>>> Yes, we have {read_packets, 100} in receive socket options.
>>>>
>>>> On Thu, May 24, 2018 at 10:23 AM, Raimo Niskanen <
>>>> raimo+erlang-questions@REDACTED> wrote:
>>>>
>>>>> On Wed, May 23, 2018 at 06:28:55PM +0300, Danil Zagoskin wrote:
>>>>> > Hi!
>>>>> >
>>>>> > We have a performance problem receiving lots of UDP traffic.
>>>>> > There are a lot (about 70) of UDP receive processes, each handling
>>>>> about 1
>>>>> > to 10 megabits of multicast traffic, with {active, N}.
>>>>>
>>>>> Whenever someone has UDP receive performance problems one has to ask
>>>>> if you
>>>>> have seen the Erlang socket option {read_packets,N}?
>>>>>
>>>>> See http://erlang.org/doc/man/inet.html#setopts-2
>>>>>
>>>>> >
>>>>> > msacc summary on my OSX laptop, build from OTP master
>>>>> > c30309e799212b080c39ee2f91af3f9a0383d767 (Apr 19):
>>>>> >
>>>>> >
>>>>> >         Thread    alloc      aux      bifbusy_wait check_io emulator
>>>>> >    ets       gc  gc_full      nif    other     port     send    sleep
>>>>> >  timers
>>>>> >      scheduler   30.02%    0.92%    2.86%   24.66%    0.01%    9.61%
>>>>> >  0.03%    1.25%    0.20%    0.13%    2.34%    9.33%    0.41%   17.78%
>>>>> >   0.44%
>>>>> >
>>>>> >
>>>>> > Linux production server behaves the same way (we do not have
>>>>> extended msacc
>>>>> > there yet, so most of alloc goes to port).
>>>>> >
>>>>> > perf top (on Linux production) says there's a lot of unaligned
>>>>> memmove:
>>>>> >
>>>>> >   69.76%  libc-2.24.so        [.] __memmove_sse2_unaligned_erms
>>>>> >    6.13%  beam.smp            [.] process_main
>>>>> >    2.02%  beam.smp            [.] erts_schedule
>>>>> >    0.87%  [kernel]            [k] copy_user_enhanced_fast_string
>>>>> >
>>>>> >
>>>>> > I'll try to make a minimal example for this.
>>>>> > Maybe there are simple recommendations on optimizing this kind of
>>>>> load?
>>>>> >
>>>>> > --
>>>>> > Danil Zagoskin | z@REDACTED
>>>>>
>>>>> > _______________________________________________
>>>>> > erlang-questions mailing list
>>>>> > erlang-questions@REDACTED
>>>>> > http://erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> / Raimo Niskanen, Erlang/OTP, Ericsson AB
>>>>> _______________________________________________
>>>>> erlang-questions mailing list
>>>>> erlang-questions@REDACTED
>>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Danil Zagoskin | z@REDACTED
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> erlang-questions@REDACTED
>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Danil Zagoskin | z@REDACTED
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>>
>>
>
>
> --
> Danil Zagoskin | z@REDACTED
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180524/5d76d1e8/attachment.htm>


More information about the erlang-questions mailing list