[erlang-questions] Improve performance of IO bounded server written in Erlang via having pollset for each scheduler and bind port to scheduler together with process

Thu Jul 12 13:01:47 CEST 2012

Hi,

Good news. With the new (today) patch:

old bench: ~70K rps
new bench: ~85K rps

More than 15K rps handled now !!
We're not far from the 100K rps ;-)

Well done Wei.

Regards,
Zabrane

On Jul 12, 2012, at 11:58 AM, Wei Cao wrote:

> 2012/7/12 Zabrane Mickael <zabrane3@REDACTED>:
>> Hi Wei,
>> 
>>>> We already surpassed the 100krps on an 8-cores machine with our HTTP server
>>>> (~150K rps).
>>> 
>>> Which erlang version did you use to get ~150k rps on 8-cores machine,
>>> patched or unpatched?
>> 
>> We reach the 150K on the unpatched version.
>> 
>> 
>>> if it was measured on a unpatched erlang
>>> version, would you mind measuring it on the patched version and let me
>>> know the result?
>> 
>> I didn't yet adapted our code to use VM with your patch.
>> I'll keep you informed.
>> 
>>> Today I found a lock bottleneck through SystemTap, trace-cmd and lcnt,
>>> after fixing it, ehttpd on my 16-cores can reach 325k rps.
>>> 
>>> RX packets: 326117 TX packets: 326122
>>> RX packets: 326845 TX packets: 326859
>>> RX packets: 327983 TX packets: 327996
>>> RX packets: 326651 TX packets: 326624
>>> 
>>> This is the upper limit of our Gigabit network card, I run ab on three
>>> standalone machines to make enough pressure, I posted the fix to
>>> github, have a try ~
>> 
>> That's simply fantastic. Could you share your bottleneck tracking method?
>> Any new VM patch to provide?
> 
> through perf top, I see there is a big percentage of time is wasted in
> kernel _spin_lock
> 
>             1894.00 16.0% _spin_lock
> /usr/lib/debug/lib/modules/2.6.32-131.21.1.tb477.el6.x86_64/vmlinux
>              566.00  4.8% process_main
> /home/mingsong.cw/erlangpps/lib/erlang/erts-5.10/bin/beam.smp
> 
> After dumping and doing a statisics of _spin_lock's call stack via
> trace-cmd,  I found most of _spin_lock is called by futex_wake, which
> is invoked by pthread mutex.
> 
> Finally, I use lcnt to locate all lock collisions in erlang VM, found
> the mutex timeofday is the bottleneck.
> 
>                                                          lock
>                location  #tries  #collisions  collisions [%]  time
> [us]  duration [%]
> 
>    -----                        --------- ------- ------------
> --------------- ---------- -------------
> 
> timeofday        'beam/erl_time_sup.c':939  895234       551957
> 61.6551    3185159       23.5296
> 
> timeofday        'beam/erl_time_sup.c':971  408006       264498
> 64.8270    1473816       10.8874
> 
> 
> the mutex timeofday is locked each time erts_check_io is invoked to
> "sync the machine's idea of time", erts_check_io is executed hundreds
> of thounds of times per second, so it's locked too much times, hence
> reduce performance.
> 
> I solved this problem by moving the sync operation into a standalone
> thread, invoked 1 time per millisecond
> 
> 
> 
>> 
>> Regards,
>> Zabrane
>> 
> 
> 
> 
> -- 
> 
> Best,
> 
> Wei Cao