[erlang-questions] performance drop in R15A?

Tomas Abrahamsson tomas.abrahamsson@REDACTED
Thu Apr 12 16:44:48 CEST 2012




> Hi Nico,
> According to erts documetation, since OTP R15B, the virtual machine
> doesn't bind the schedulers to logical processors.
> I don't know if it's the reason of your performance drop but it might
> worst looking around the +sbt option.

There's also the +scl true|false flag, to "[e]nable or
disable scheduler compaction of load", that's new in
R15, iirc, but the +sbt option may have more impact.

With the particular workload we previously had, when we
continued to measure more, it turned out that R15B was
actually faster than R14B04 on machines with more CPUs:

We had one machine with 2 Xeon X5660 CPUs (24
hyperthreads in total) where R15B was faster.  We had
another machine with 2 Xeon E5540 CPUs (16 HTs in
total) where I think R15B and R14B04 were roughly on
par with each other.

We also did experiment a bit with the +rg ReaderGroupsLimit
option. I think we did set it to be the same as the
number of schedulers (online).

BRs
Tomas

>> is there any news on this?
>> Our internal test server has a new additional configuration which compiles
>> the
>> latest otp from github once per day and tests its performance with the
>> mentioned benchmarks.
>> The resulting graphs for R14B04, R15B and latest git checkout are
>> attached. As
>> you can see - nothing has changed since then so I suppose it is still
>> present
>> in R15B01 :(
>> 
>> I don't know how to debug this any further. If there is anything I could
>> do to
>> help you identify the issue, please tell me.
>> 
>> 
>> Regards
>> Nico Kruber
>> 
>> On Wednesday 14 December 2011 17:41:33 Nico Kruber wrote:
>>> ok, I did some profiling and made some optimisations but the differences
>>> between R15B (git checkout from master just now) and R14B04 are still
>>> the
>>> same:
>>> 
>>> First the results of the calls to bench:quorum_read(10, 50000) (best of
>>> three) ------------------------------------------------------------
>>> 
>>> |   R14B04   |   R15B     | time surplus
>>> 
>>> ------------------------------------------------------------
>>> timer:tc, smp8:8   |    22484619|    26641737| +18%
>>> timer:tc, smp4:4   |    34009785|    35963489| + 6%
>>> timer:tc, smp2:2   |    32863428|    34102038| + 4%
>>> timer:tc, smpoff   |    53805419|    54214795| + 1%
>>> ------------------------------------------------------------
>>> 
>>> It seems there is some lock-contention or some other problem with smp in
>>> our
>>> case which hasn't been there before.
>>> I also observed that the results using R15B vary quite much, especially
>>> with
>>> smp8:8 (see below for an example). R14B04 is more stable among different
>>> test runs.
>>> But still, even without smp, R15B is slower than R14B04 :(
>>> 
>>> Lukas suggested to run fprof to a bit deeper and maybe find the function
>>> taking longer in R15 than in R14B04.
>>> Although it was helpful in finding some hot spots, it did however not
>>> help
>>> with this problem. Following are the totals reported by fprof
>>> (if you like, I can send the complete logs - didn't want to send too
>>> much to
>>> the mailing list though).
>>> 
>>> ------------------------------------------------------------
>>> 
>>> |   R14B04   |   R15B     | time surplus
>>> 
>>> ------------------------------------------------------------
>>> fprof, CNT, smp8:8 |     7723796|     7722635| n/a
>>> fprof, ACC, smp8:8 |       98239|       85477| -13%
>>> fprof, OWN, smp8:8 |      300534|      264018| -12%
>>> ------------------------------------------------------------
>>> 
>>> As you can see, according to fprof, R15B should be faster (but it
>>> isn't!).
>>> I assume that some function calls are not reported by fprof
>>> (e.g. I did not see erlang:length/1).
>>> Is there some way to include them again in order to analyse further?
>>> Maybe suspend or garbage_collect are not in ACC or OWN?
>>> But even here, R15B wastes less time as you can see in the table below.
>>> ----------------------------------------------------------
>>> 
>>> |   R14B04   |   R15B     |
>>> 
>>> ----------------------------------------------------------
>>> fprof, suspend, smp8:8         |  9.8421e+06| 8.51855e+06|
>>> fprof, garbage_collect, smp8:8 |        1315|         794|
>>> ----------------------------------------------------------
>>> 
>>> We do rely quite much on the process dictionary (erlang:put and
>>> erlang:get)
>>> which you might see as an "non-typical usage pattern" for erlang.
>>> However, if I switch to ets instead, performance of
>>> bench:quorum_read(10,
>>> 50000) for R15B does not change and for R14B04 decreases by about 7%.
>>> 
>>> Everything else I would consider "typical use" - or at least I'm not
>>> aware
>>> of anything untypical.
>>> 
>>> I hope, someone can help to further analyse the problem. I did
>>> everything I
>>> could think of and are stuck now.
>>> Is other real-world software also affected? E.g. testing couchdb would
>>> be
>>> nice.
>>> 
>>> 
>>> Regards
>>> Nico Kruber
>>> 
>>> 
>>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>> how to re-produce the values
>>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>> 
>>> I'm using scalaris from http://code.google.com/p/scalaris/
>>> ./configure && make
>>> ================ R15B ===============
>>> --------------------------------
>>> ./bin/firstnode.sh
>>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:8:8] [async-threads:0]
>>> [hipe]
>>> [kernel-poll:false]
>>> 
>>> timer:tc(bench, quorum_read, [10, 50000]).
>>> {33764932,ok} % 1st try
>>> {26641737,ok} % 2nd try
>>> {27575975,ok} % 3rd try
>>> --------------------------------
>>> ./bin/firstnode.sh "+S4:4"
>>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:4:4] [async-threads:0]
>>> [hipe]
>>> [kernel-poll:false]
>>> 
>>> timer:tc(bench, quorum_read, [10, 50000]).
>>> {35963489,ok} % 1st try
>>> {36332156,ok} % 2nd try
>>> {37295819,ok} % 3rd try
>>> --------------------------------
>>> ./bin/firstnode.sh "+S2:2"
>>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:2:2] [async-threads:0]
>>> [hipe]
>>> [kernel-poll:false]
>>> 
>>> timer:tc(bench, quorum_read, [10, 50000]).
>>> {34102038,ok} % 1st try
>>> {34379052,ok} % 2nd try
>>> {34416173,ok} % 3rd try
>>> --------------------------------
>>> ./bin/firstnode.sh "-smp disable"
>>> Erlang R15B (erts-5.9) [source] [64-bit] [async-threads:0] [hipe]
>>> [kernel-poll:false]
>>> 
>>> timer:tc(bench, quorum_read, [10, 50000]).
>>> {54214795,ok} % 1st try
>>> {55210947,ok} % 2nd try
>>> {54537194,ok} % 3rd try
>>> 
>>> ================ R14B04 ===============
>>> --------------------------------
>>> ./bin/firstnode.sh
>>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8]
>>> [async-threads:0] [hipe] [kernel-poll:false]
>>> 
>>> timer:tc(bench, quorum_read, [10, 50000]).
>>> {22598819,ok} % 1st try
>>> {22706185,ok} % 2nd try
>>> {22484619,ok} % 3rd try
>>> --------------------------------
>>> ./bin/firstnode.sh "+S4:4"
>>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:4:4] [rq:4]
>>> [async-threads:0] [hipe] [kernel-poll:false]
>>> 
>>> timer:tc(bench, quorum_read, [10, 50000]).
>>> {35482368,ok} % 1st try
>>> {34009785,ok} % 2nd try
>>> {34527203,ok} % 3rd try
>>> --------------------------------
>>> ./bin/firstnode.sh "+S2:2"
>>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2]
>>> [async-threads:0] [hipe] [kernel-poll:false]
>>> 
>>> timer:tc(bench, quorum_read, [10, 50000]).
>>> {33590356,ok} % 1st try
>>> {33192033,ok} % 2nd try
>>> {32863428,ok} % 3rd try
>>> --------------------------------
>>> ./bin/firstnode.sh "-smp disable"
>>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [rq:1] [async-threads:0]
>>> [hipe]
>>> [kernel-poll:false]
>>> 
>>> timer:tc(bench, quorum_read, [10, 50000]).
>>> {53805419,ok} % 1st try
>>> {54054538,ok} % 2nd try
>>> {54066521,ok} % 3rd try
>>> 
>>> 
>>> fprof runs:
>>> ./bin/firstnode.sh
>>> 
>>> fprof:trace([start, {procs, processes()}]).
>>> bench:quorum_read(10, 1000).
>>> fprof:trace(stop).
>>> fprof:profile().
>>> fprof:analyse([{dest, "test.R15B.fprof.log"}]). % or R14B04 respectively
>>> 
>>> > cat test.R14B04.fprof.log |grep suspend|cut -d',' -f3 | awk '{s+=$1}
>>> END
>>> > {print s}'
>>> 9.8421e+06
>>> 
>>> > cat test.R15B.fprof.log |grep suspend|cut -d',' -f3 | awk '{s+=$1} END
>>> > {print s}'
>>> 8.51855e+06
>>> 
>>> > cat test.R14B04.fprof.log |grep garbage_collect|cut -d',' -f3 | awk
>>> > '{s+=$1} END {print s}'
>>> 1315.33
>>> 
>>> > cat test.R15B.fprof.log |grep garbage_collect|cut -d',' -f3 | awk
>>> '{s+=$1}
>>> > END {print s}'
>>> 794.956
>>> 
>>> On Friday 09 December 2011 12:09:56 Kenneth Lundin wrote:
>>> > On 12/5/11, Nico Kruber <kruber@REDACTED> wrote:
>>> > > Is there any news on this?
>>> > >
>>> > > The performance drop of the recent R15 checkout compared to R14B04
>>> in
>>> > > our
>>> > > application is now at _30%_ compared to a drop of "only" 15% in
>>> R15A!
>>> >
>>> > When working with R15B we are of course performing a number of
>>> > bencmarks to keep track of the performance.
>>> >
>>> > Whe always strive to improve general performance in each release AND
>>> > OUR BENCHMARKS INDICATE THAT R15B IS FASTER IN GENERAL THAN R14B04.
>>> >
>>> > It is however always possible that some specific operations or
>>> > combinations of operations get slower while others are optimized.
>>> >
>>> > Regarding the findings for this particular application we think the
>>> > reason for slowdown is a very frequent but non typical usage pattern
>>> > that now becomes visible as a performance bottleneck.
>>> >
>>> >
>>> > / Kenneth, Erlang/OTP
>>> Ericsson_______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>> 

> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions




More information about the erlang-questions mailing list