[erlang-questions] performance drop in R15A?

Mon Apr 16 16:03:49 CEST 2012

Thank you for the tips, indeed, by using the +sbt option I was able to improve 
our benchmark results to a certain extend - not to the full R14B04 value 
though :(
I achieved the best results together with the +swt low and very_low option, so 
I would probably go with "low". I also experienced with +rg and +scl but these 
didn't influence the results too much.

I also attached the cpuinfo from the test system (basically an 8-core Opteron 
2376 with 8GB RAM).

using Scalaris "make bench":
           wall clock throughput // latency
R14B04:                         999  //  40
 "+swt low"                    1097  //  36
 "+swt very_low"               1101  //  36
R15B: (same as "+sbt u")        679  //  59
 "+scl true"                    682  //  59
 "+sbt db"                      836  //  48
 "+sbt ns"                      828  //  48
 "+sbt ts"                      840  //  48
 "+sbt ps"                      816  //  49
 "+sbt s"                       822  //  49
 "+sbt nnts"                    825  //  48
 "+sbt nnps"                    834  //  48
 "+sbt tnnps"                   830  //  48
 "+scl true +sbt db"            827  //  48
 "+scl true +sbt db +swt low"  1029  //  39
 "+sbt db +swt low"            1019  //  39
 "+sbt db +swt very_low"       1038  //  39
 "+swt low"                     764  //  52
 "+sbt db +swt low +rg 4"      1008  //  40
 "+sbt db +swt low +rg 8"      1042  //  38
 "+sbt db +swt low +rg 16"     1011  //  40

Recall that this creates 4 erlang VMs with the same (given) start-up options.

On Thursday 12 April 2012 16:44:48 Tomas Abrahamsson wrote:
> > Hi Nico,
> > According to erts documetation, since OTP R15B, the virtual machine
> > doesn't bind the schedulers to logical processors.
> > I don't know if it's the reason of your performance drop but it might
> > worst looking around the +sbt option.
> 
> There's also the +scl true|false flag, to "[e]nable or
> disable scheduler compaction of load", that's new in
> R15, iirc, but the +sbt option may have more impact.
> 
> With the particular workload we previously had, when we
> continued to measure more, it turned out that R15B was
> actually faster than R14B04 on machines with more CPUs:
> 
> We had one machine with 2 Xeon X5660 CPUs (24
> hyperthreads in total) where R15B was faster.  We had
> another machine with 2 Xeon E5540 CPUs (16 HTs in
> total) where I think R15B and R14B04 were roughly on
> par with each other.
> 
> We also did experiment a bit with the +rg ReaderGroupsLimit
> option. I think we did set it to be the same as the
> number of schedulers (online).
> 
> BRs
> Tomas
> 
> >> is there any news on this?
> >> Our internal test server has a new additional configuration which
> >> compiles
> >> the
> >> latest otp from github once per day and tests its performance with the
> >> mentioned benchmarks.
> >> The resulting graphs for R14B04, R15B and latest git checkout are
> >> attached. As
> >> you can see - nothing has changed since then so I suppose it is still
> >> present
> >> in R15B01 :(
> >> 
> >> I don't know how to debug this any further. If there is anything I could
> >> do to
> >> help you identify the issue, please tell me.
> >> 
> >> 
> >> Regards
> >> Nico Kruber
> >> 
> >> On Wednesday 14 December 2011 17:41:33 Nico Kruber wrote:
> >>> ok, I did some profiling and made some optimisations but the differences
> >>> between R15B (git checkout from master just now) and R14B04 are still
> >>> the
> >>> same:
> >>> 
> >>> First the results of the calls to bench:quorum_read(10, 50000) (best of
> >>> three) ------------------------------------------------------------
> >>> 
> >>> |   R14B04   |   R15B     | time surplus
> >>> 
> >>> ------------------------------------------------------------
> >>> timer:tc, smp8:8   |    22484619|    26641737| +18%
> >>> timer:tc, smp4:4   |    34009785|    35963489| + 6%
> >>> timer:tc, smp2:2   |    32863428|    34102038| + 4%
> >>> timer:tc, smpoff   |    53805419|    54214795| + 1%
> >>> ------------------------------------------------------------
> >>> 
> >>> It seems there is some lock-contention or some other problem with smp in
> >>> our
> >>> case which hasn't been there before.
> >>> I also observed that the results using R15B vary quite much, especially
> >>> with
> >>> smp8:8 (see below for an example). R14B04 is more stable among different
> >>> test runs.
> >>> But still, even without smp, R15B is slower than R14B04 :(
> >>> 
> >>> Lukas suggested to run fprof to a bit deeper and maybe find the function
> >>> taking longer in R15 than in R14B04.
> >>> Although it was helpful in finding some hot spots, it did however not
> >>> help
> >>> with this problem. Following are the totals reported by fprof
> >>> (if you like, I can send the complete logs - didn't want to send too
> >>> much to
> >>> the mailing list though).
> >>> 
> >>> ------------------------------------------------------------
> >>> 
> >>> |   R14B04   |   R15B     | time surplus
> >>> 
> >>> ------------------------------------------------------------
> >>> fprof, CNT, smp8:8 |     7723796|     7722635| n/a
> >>> fprof, ACC, smp8:8 |       98239|       85477| -13%
> >>> fprof, OWN, smp8:8 |      300534|      264018| -12%
> >>> ------------------------------------------------------------
> >>> 
> >>> As you can see, according to fprof, R15B should be faster (but it
> >>> isn't!).
> >>> I assume that some function calls are not reported by fprof
> >>> (e.g. I did not see erlang:length/1).
> >>> Is there some way to include them again in order to analyse further?
> >>> Maybe suspend or garbage_collect are not in ACC or OWN?
> >>> But even here, R15B wastes less time as you can see in the table below.
> >>> ----------------------------------------------------------
> >>> 
> >>> |   R14B04   |   R15B     |
> >>> 
> >>> ----------------------------------------------------------
> >>> fprof, suspend, smp8:8         |  9.8421e+06| 8.51855e+06|
> >>> fprof, garbage_collect, smp8:8 |        1315|         794|
> >>> ----------------------------------------------------------
> >>> 
> >>> We do rely quite much on the process dictionary (erlang:put and
> >>> erlang:get)
> >>> which you might see as an "non-typical usage pattern" for erlang.
> >>> However, if I switch to ets instead, performance of
> >>> bench:quorum_read(10,
> >>> 50000) for R15B does not change and for R14B04 decreases by about 7%.
> >>> 
> >>> Everything else I would consider "typical use" - or at least I'm not
> >>> aware
> >>> of anything untypical.
> >>> 
> >>> I hope, someone can help to further analyse the problem. I did
> >>> everything I
> >>> could think of and are stuck now.
> >>> Is other real-world software also affected? E.g. testing couchdb would
> >>> be
> >>> nice.
> >>> 
> >>> 
> >>> Regards
> >>> Nico Kruber
> >>> 
> >>> 
> >>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> >>> how to re-produce the values
> >>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> >>> 
> >>> I'm using scalaris from http://code.google.com/p/scalaris/
> >>> ./configure && make
> >>> ================ R15B ===============
> >>> --------------------------------
> >>> ./bin/firstnode.sh
> >>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:8:8] [async-threads:0]
> >>> [hipe]
> >>> [kernel-poll:false]
> >>> 
> >>> timer:tc(bench, quorum_read, [10, 50000]).
> >>> {33764932,ok} % 1st try
> >>> {26641737,ok} % 2nd try
> >>> {27575975,ok} % 3rd try
> >>> --------------------------------
> >>> ./bin/firstnode.sh "+S4:4"
> >>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:4:4] [async-threads:0]
> >>> [hipe]
> >>> [kernel-poll:false]
> >>> 
> >>> timer:tc(bench, quorum_read, [10, 50000]).
> >>> {35963489,ok} % 1st try
> >>> {36332156,ok} % 2nd try
> >>> {37295819,ok} % 3rd try
> >>> --------------------------------
> >>> ./bin/firstnode.sh "+S2:2"
> >>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:2:2] [async-threads:0]
> >>> [hipe]
> >>> [kernel-poll:false]
> >>> 
> >>> timer:tc(bench, quorum_read, [10, 50000]).
> >>> {34102038,ok} % 1st try
> >>> {34379052,ok} % 2nd try
> >>> {34416173,ok} % 3rd try
> >>> --------------------------------
> >>> ./bin/firstnode.sh "-smp disable"
> >>> Erlang R15B (erts-5.9) [source] [64-bit] [async-threads:0] [hipe]
> >>> [kernel-poll:false]
> >>> 
> >>> timer:tc(bench, quorum_read, [10, 50000]).
> >>> {54214795,ok} % 1st try
> >>> {55210947,ok} % 2nd try
> >>> {54537194,ok} % 3rd try
> >>> 
> >>> ================ R14B04 ===============
> >>> --------------------------------
> >>> ./bin/firstnode.sh
> >>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8]
> >>> [async-threads:0] [hipe] [kernel-poll:false]
> >>> 
> >>> timer:tc(bench, quorum_read, [10, 50000]).
> >>> {22598819,ok} % 1st try
> >>> {22706185,ok} % 2nd try
> >>> {22484619,ok} % 3rd try
> >>> --------------------------------
> >>> ./bin/firstnode.sh "+S4:4"
> >>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:4:4] [rq:4]
> >>> [async-threads:0] [hipe] [kernel-poll:false]
> >>> 
> >>> timer:tc(bench, quorum_read, [10, 50000]).
> >>> {35482368,ok} % 1st try
> >>> {34009785,ok} % 2nd try
> >>> {34527203,ok} % 3rd try
> >>> --------------------------------
> >>> ./bin/firstnode.sh "+S2:2"
> >>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2]
> >>> [async-threads:0] [hipe] [kernel-poll:false]
> >>> 
> >>> timer:tc(bench, quorum_read, [10, 50000]).
> >>> {33590356,ok} % 1st try
> >>> {33192033,ok} % 2nd try
> >>> {32863428,ok} % 3rd try
> >>> --------------------------------
> >>> ./bin/firstnode.sh "-smp disable"
> >>> Erlang R14B04 (erts-5.8.5) [source] [64-bit] [rq:1] [async-threads:0]
> >>> [hipe]
> >>> [kernel-poll:false]
> >>> 
> >>> timer:tc(bench, quorum_read, [10, 50000]).
> >>> {53805419,ok} % 1st try
> >>> {54054538,ok} % 2nd try
> >>> {54066521,ok} % 3rd try
> >>> 
> >>> 
> >>> fprof runs:
> >>> ./bin/firstnode.sh
> >>> 
> >>> fprof:trace([start, {procs, processes()}]).
> >>> bench:quorum_read(10, 1000).
> >>> fprof:trace(stop).
> >>> fprof:profile().
> >>> fprof:analyse([{dest, "test.R15B.fprof.log"}]). % or R14B04 respectively
> >>> 
> >>> > cat test.R14B04.fprof.log |grep suspend|cut -d',' -f3 | awk '{s+=$1}
> >>> 
> >>> END
> >>> 
> >>> > {print s}'
> >>> 
> >>> 9.8421e+06
> >>> 
> >>> > cat test.R15B.fprof.log |grep suspend|cut -d',' -f3 | awk '{s+=$1} END
> >>> > {print s}'
> >>> 
> >>> 8.51855e+06
> >>> 
> >>> > cat test.R14B04.fprof.log |grep garbage_collect|cut -d',' -f3 | awk
> >>> > '{s+=$1} END {print s}'
> >>> 
> >>> 1315.33
> >>> 
> >>> > cat test.R15B.fprof.log |grep garbage_collect|cut -d',' -f3 | awk
> >>> 
> >>> '{s+=$1}
> >>> 
> >>> > END {print s}'
> >>> 
> >>> 794.956
> >>> 
> >>> On Friday 09 December 2011 12:09:56 Kenneth Lundin wrote:
> >>> > On 12/5/11, Nico Kruber <kruber@REDACTED> wrote:
> >>> > > Is there any news on this?
> >>> > > 
> >>> > > The performance drop of the recent R15 checkout compared to R14B04
> >>> 
> >>> in
> >>> 
> >>> > > our
> >>> > > application is now at _30%_ compared to a drop of "only" 15% in
> >>> 
> >>> R15A!
> >>> 
> >>> > When working with R15B we are of course performing a number of
> >>> > bencmarks to keep track of the performance.
> >>> > 
> >>> > Whe always strive to improve general performance in each release AND
> >>> > OUR BENCHMARKS INDICATE THAT R15B IS FASTER IN GENERAL THAN R14B04.
> >>> > 
> >>> > It is however always possible that some specific operations or
> >>> > combinations of operations get slower while others are optimized.
> >>> > 
> >>> > Regarding the findings for this particular application we think the
> >>> > reason for slowdown is a very frequent but non typical usage pattern
> >>> > that now becomes visible as a performance bottleneck.
> >>> > 
> >>> > 
> >>> > / Kenneth, Erlang/OTP
> >>> 
> >>> Ericsson_______________________________________________
-------------- next part --------------
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 4
model name	: Quad-Core AMD Opteron(tm) Processor 2376
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips	: 4599.91
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 4
model name	: Quad-Core AMD Opteron(tm) Processor 2376
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips	: 4599.99
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor	: 2
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 4
model name	: Quad-Core AMD Opteron(tm) Processor 2376
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 0
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips	: 4600.00
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor	: 3
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 4
model name	: Quad-Core AMD Opteron(tm) Processor 2376
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips	: 4599.98
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor	: 4
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 4
model name	: Quad-Core AMD Opteron(tm) Processor 2376
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 1
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips	: 4600.00
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor	: 5
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 4
model name	: Quad-Core AMD Opteron(tm) Processor 2376
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 1
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 5
initial apicid	: 5
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips	: 4600.01
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor	: 6
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 4
model name	: Quad-Core AMD Opteron(tm) Processor 2376
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 1
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips	: 4600.01
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor	: 7
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 4
model name	: Quad-Core AMD Opteron(tm) Processor 2376
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 1
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips	: 4600.00
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate