[erlang-questions] performance drop in R15A?

Mon Apr 16 16:08:51 CEST 2012

Thanks Nico for sharing such useful info.

Regards,
Zabrane

On Apr 16, 2012, at 4:03 PM, Nico Kruber wrote:

> Thank you for the tips, indeed, by using the +sbt option I was able to improve 
> our benchmark results to a certain extend - not to the full R14B04 value 
> though :(
> I achieved the best results together with the +swt low and very_low option, so 
> I would probably go with "low". I also experienced with +rg and +scl but these 
> didn't influence the results too much.
> 
> I also attached the cpuinfo from the test system (basically an 8-core Opteron 
> 2376 with 8GB RAM).
> 
> using Scalaris "make bench":
>           wall clock throughput // latency
> R14B04:                         999  //  40
> "+swt low"                    1097  //  36
> "+swt very_low"               1101  //  36
> R15B: (same as "+sbt u")        679  //  59
> "+scl true"                    682  //  59
> "+sbt db"                      836  //  48
> "+sbt ns"                      828  //  48
> "+sbt ts"                      840  //  48
> "+sbt ps"                      816  //  49
> "+sbt s"                       822  //  49
> "+sbt nnts"                    825  //  48
> "+sbt nnps"                    834  //  48
> "+sbt tnnps"                   830  //  48
> "+scl true +sbt db"            827  //  48
> "+scl true +sbt db +swt low"  1029  //  39
> "+sbt db +swt low"            1019  //  39
> "+sbt db +swt very_low"       1038  //  39
> "+swt low"                     764  //  52
> "+sbt db +swt low +rg 4"      1008  //  40
> "+sbt db +swt low +rg 8"      1042  //  38
> "+sbt db +swt low +rg 16"     1011  //  40
> 
> Recall that this creates 4 erlang VMs with the same (given) start-up options.
> 
> 
> On Thursday 12 April 2012 16:44:48 Tomas Abrahamsson wrote:
>>> Hi Nico,
>>> According to erts documetation, since OTP R15B, the virtual machine
>>> doesn't bind the schedulers to logical processors.
>>> I don't know if it's the reason of your performance drop but it might
>>> worst looking around the +sbt option.
>> 
>> There's also the +scl true|false flag, to "[e]nable or
>> disable scheduler compaction of load", that's new in
>> R15, iirc, but the +sbt option may have more impact.
>> 
>> With the particular workload we previously had, when we
>> continued to measure more, it turned out that R15B was
>> actually faster than R14B04 on machines with more CPUs:
>> 
>> We had one machine with 2 Xeon X5660 CPUs (24
>> hyperthreads in total) where R15B was faster.  We had
>> another machine with 2 Xeon E5540 CPUs (16 HTs in
>> total) where I think R15B and R14B04 were roughly on
>> par with each other.
>> 
>> We also did experiment a bit with the +rg ReaderGroupsLimit
>> option. I think we did set it to be the same as the
>> number of schedulers (online).
>> 
>> BRs
>> Tomas
>> 
>>>> is there any news on this?
>>>> Our internal test server has a new additional configuration which
>>>> compiles
>>>> the
>>>> latest otp from github once per day and tests its performance with the
>>>> mentioned benchmarks.
>>>> The resulting graphs for R14B04, R15B and latest git checkout are
>>>> attached. As
>>>> you can see - nothing has changed since then so I suppose it is still
>>>> present
>>>> in R15B01 :(
>>>> 
>>>> I don't know how to debug this any further. If there is anything I could
>>>> do to
>>>> help you identify the issue, please tell me.
>>>> 
>>>> 
>>>> Regards
>>>> Nico Kruber
>>>> 
>>>> On Wednesday 14 December 2011 17:41:33 Nico Kruber wrote:
>>>>> ok, I did some profiling and made some optimisations but the differences
>>>>> between R15B (git checkout from master just now) and R14B04 are still
>>>>> the
>>>>> same:
>>>>> 
>>>>> First the results of the calls to bench:quorum_read(10, 50000) (best of
>>>>> three) ------------------------------------------------------------
>>>>> 
>>>>> |   R14B04   |   R15B     | time surplus
>>>>> 
>>>>> ------------------------------------------------------------
>>>>> timer:tc, smp8:8   |    22484619|    26641737| +18%
>>>>> timer:tc, smp4:4   |    34009785|    35963489| + 6%
>>>>> timer:tc, smp2:2   |    32863428|    34102038| + 4%
>>>>> timer:tc, smpoff   |    53805419|    54214795| + 1%
>>>>> ------------------------------------------------------------
>>>>> 
>>>>> It seems there is some lock-contention or some other problem with smp in
>>>>> our
>>>>> case which hasn't been there before.
>>>>> I also observed that the results using R15B vary quite much, especially
>>>>> with
>>>>> smp8:8 (see below for an example). R14B04 is more stable among different
>>>>> test runs.
>>>>> But still, even without smp, R15B is slower than R14B04 :(
>>>>> 
>>>>> Lukas suggested to run fprof to a bit deeper and maybe find the function
>>>>> taking longer in R15 than in R14B04.
>>>>> Although it was helpful in finding some hot spots, it did however not
>>>>> help
>>>>> with this problem. Following are the totals reported by fprof
>>>>> (if you like, I can send the complete logs - didn't want to send too
>>>>> much to
>>>>> the mailing list though).
>>>>> 
>>>>> ------------------------------------------------------------
>>>>> 
>>>>> |   R14B04   |   R15B     | time surplus
>>>>> 
>>>>> ------------------------------------------------------------
>>>>> fprof, CNT, smp8:8 |     7723796|     7722635| n/a
>>>>> fprof, ACC, smp8:8 |       98239|       85477| -13%
>>>>> fprof, OWN, smp8:8 |      300534|      264018| -12%
>>>>> ------------------------------------------------------------
>>>>> 
>>>>> As you can see, according to fprof, R15B should be faster (but it
>>>>> isn't!).
>>>>> I assume that some function calls are not reported by fprof
>>>>> (e.g. I did not see erlang:length/1).
>>>>> Is there some way to include them again in order to analyse further?
>>>>> Maybe suspend or garbage_collect are not in ACC or OWN?
>>>>> But even here, R15B wastes less time as you can see in the table below.
>>>>> ----------------------------------------------------------
>>>>> 
>>>>> |   R14B04   |   R15B     |
>>>>> 
>>>>> ----------------------------------------------------------
>>>>> fprof, suspend, smp8:8         |  9.8421e+06| 8.51855e+06|
>>>>> fprof, garbage_collect, smp8:8 |        1315|         794|
>>>>> ----------------------------------------------------------
>>>>> 
>>>>> We do rely quite much on the process dictionary (erlang:put and
>>>>> erlang:get)
>>>>> which you might see as an "non-typical usage pattern" for erlang.
>>>>> However, if I switch to ets instead, performance of
>>>>> bench:quorum_read(10,
>>>>> 50000) for R15B does not change and for R14B04 decreases by about 7%.
>>>>> 
>>>>> Everything else I would consider "typical use" - or at least I'm not
>>>>> aware
>>>>> of anything untypical.
>>>>> 
>>>>> I hope, someone can help to further analyse the problem. I did
>>>>> everything I
>>>>> could think of and are stuck now.
>>>>> Is other real-world software also affected? E.g. testing couchdb would
>>>>> be
>>>>> nice.
>>>>> 
>>>>> 
>>>>> Regards
>>>>> Nico Kruber
>>>>> 
>>>>> 
>>>>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>>>> how to re-produce the values
>>>>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>>>> 
>>>>> I'm using scalaris from http://code.google.com/p/scalaris/
>>>>> ./configure && make
>>>>> ================ R15B ===============
>>>>> --------------------------------
>>>>> ./bin/firstnode.sh
>>>>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:8:8] [async-threads:0]
>>>>> [hipe]
>>>>> [kernel-poll:false]
>>>>> 
>>>>> timer:tc(bench, quorum_read, [10, 50000]).
>>>>> {33764932,ok} % 1st try
>>>>> {26641737,ok} % 2nd try
>>>>> {27575975,ok} % 3rd try
>>>>> --------------------------------
>>>>> ./bin/firstnode.sh "+S4:4"
>>>>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:4:4] [async-threads:0]
>>>>> [hipe]
>>>>> [kernel-poll:false]
>>>>> 
>>>>> timer:tc(bench, quorum_read, [10, 50000]).
>>>>> {35963489,ok} % 1st try
>>>>> {36332156,ok} % 2nd try
>>>>> {37295819,ok} % 3rd try
>>>>> --------------------------------
>>>>> ./bin/firstnode.sh "+S2:2"
>>>>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:2:2] [async-threads:0]
>>>>> [hipe]
>>>>> [kernel-poll:false]
>>>>> 
>>>>> timer:tc(bench, quorum_read, [10, 50000]).
>>>>> {34102038,ok} % 1st try
>>>>> {34379052,ok} % 2nd try
>>>>> {34416173,ok} % 3rd try
>>>>> --------------------------------
>>>>> ./bin/firstnode.sh "-smp disable"
>>>>> Erlang R15B (erts-5.9) [source] [64-bit] [async-threads:0] [hipe]
>>>>> [kernel-poll:false]
>>>>> 
>>>>> timer:tc(bench, quorum_read, [10, 50000]).
>>>>> {54214795,ok} % 1st try
>>>>> {55210947,ok} % 2nd try
>>>>> {54537194,ok} % 3rd try
>>>>> 
>>>>> ================ R14B04 ===============
>>>>> --------------------------------
>>>>> ./bin/firstnode.sh
>>>>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8]
>>>>> [async-threads:0] [hipe] [kernel-poll:false]
>>>>> 
>>>>> timer:tc(bench, quorum_read, [10, 50000]).
>>>>> {22598819,ok} % 1st try
>>>>> {22706185,ok} % 2nd try
>>>>> {22484619,ok} % 3rd try
>>>>> --------------------------------
>>>>> ./bin/firstnode.sh "+S4:4"
>>>>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:4:4] [rq:4]
>>>>> [async-threads:0] [hipe] [kernel-poll:false]
>>>>> 
>>>>> timer:tc(bench, quorum_read, [10, 50000]).
>>>>> {35482368,ok} % 1st try
>>>>> {34009785,ok} % 2nd try
>>>>> {34527203,ok} % 3rd try
>>>>> --------------------------------
>>>>> ./bin/firstnode.sh "+S2:2"
>>>>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2]
>>>>> [async-threads:0] [hipe] [kernel-poll:false]
>>>>> 
>>>>> timer:tc(bench, quorum_read, [10, 50000]).
>>>>> {33590356,ok} % 1st try
>>>>> {33192033,ok} % 2nd try
>>>>> {32863428,ok} % 3rd try
>>>>> --------------------------------
>>>>> ./bin/firstnode.sh "-smp disable"
>>>>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [rq:1] [async-threads:0]
>>>>> [hipe]
>>>>> [kernel-poll:false]
>>>>> 
>>>>> timer:tc(bench, quorum_read, [10, 50000]).
>>>>> {53805419,ok} % 1st try
>>>>> {54054538,ok} % 2nd try
>>>>> {54066521,ok} % 3rd try
>>>>> 
>>>>> 
>>>>> fprof runs:
>>>>> ./bin/firstnode.sh
>>>>> 
>>>>> fprof:trace([start, {procs, processes()}]).
>>>>> bench:quorum_read(10, 1000).
>>>>> fprof:trace(stop).
>>>>> fprof:profile().
>>>>> fprof:analyse([{dest, "test.R15B.fprof.log"}]). % or R14B04 respectively
>>>>> 
>>>>>> cat test.R14B04.fprof.log |grep suspend|cut -d',' -f3 | awk '{s+=$1}
>>>>> 
>>>>> END
>>>>> 
>>>>>> {print s}'
>>>>> 
>>>>> 9.8421e+06
>>>>> 
>>>>>> cat test.R15B.fprof.log |grep suspend|cut -d',' -f3 | awk '{s+=$1} END
>>>>>> {print s}'
>>>>> 
>>>>> 8.51855e+06
>>>>> 
>>>>>> cat test.R14B04.fprof.log |grep garbage_collect|cut -d',' -f3 | awk
>>>>>> '{s+=$1} END {print s}'
>>>>> 
>>>>> 1315.33
>>>>> 
>>>>>> cat test.R15B.fprof.log |grep garbage_collect|cut -d',' -f3 | awk
>>>>> 
>>>>> '{s+=$1}
>>>>> 
>>>>>> END {print s}'
>>>>> 
>>>>> 794.956
>>>>> 
>>>>> On Friday 09 December 2011 12:09:56 Kenneth Lundin wrote:
>>>>>> On 12/5/11, Nico Kruber <kruber@REDACTED> wrote:
>>>>>>> Is there any news on this?
>>>>>>> 
>>>>>>> The performance drop of the recent R15 checkout compared to R14B04
>>>>> 
>>>>> in
>>>>> 
>>>>>>> our
>>>>>>> application is now at _30%_ compared to a drop of "only" 15% in
>>>>> 
>>>>> R15A!
>>>>> 
>>>>>> When working with R15B we are of course performing a number of
>>>>>> bencmarks to keep track of the performance.
>>>>>> 
>>>>>> Whe always strive to improve general performance in each release AND
>>>>>> OUR BENCHMARKS INDICATE THAT R15B IS FASTER IN GENERAL THAN R14B04.
>>>>>> 
>>>>>> It is however always possible that some specific operations or
>>>>>> combinations of operations get slower while others are optimized.
>>>>>> 
>>>>>> Regarding the findings for this particular application we think the
>>>>>> reason for slowdown is a very frequent but non typical usage pattern
>>>>>> that now becomes visible as a performance bottleneck.
>>>>>> 
>>>>>> 
>>>>>> / Kenneth, Erlang/OTP
>>>>> 
>>>>> Ericsson_______________________________________________
> <bench.cpuinfo>_______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions