[erlang-questions] Dirty CPU schedulers stuck at zero utilization

Jonas Falkevik jonas.falkevik@REDACTED
Wed Jan 9 09:33:03 CET 2019


Hi,
This sounds like the issue I bumped into.
After some time, many of the dirty schedulers are not performing any work anymore.
Which was related to a corrupted list of sleeping dirty schedulers.

I haven’t found what triggered it in my case (maybe a GC?). For me it took approx. 1 day before it happened.

Doing Multi Scheduler Block can trigger the problem with the list of sleeping dirty schedulers which gets corrupted.
The sleepers list can contain only one or some schedulers. That list is later used to wake up schedulers when there is work to do.
For each enqueued dirty-cpu or dirty-io job, it wakes the first scheduler in the sleepers list.
But if this list only point to a few schedulers only those are woken up to pick jobs from the queue.

Since fixing that specific problem I have not seen the collapse of dirty schedulers anymore.

It would be great if you can test and see if it fixes your problem as well, https://github.com/erlang/otp/pull/2027
Any feedback on the bug fix is appreciated.

/Jonas

> On 8 Jan 2019, at 17:13, Jesse Stimpson <jstimpson@REDACTED> wrote:
> 
> Hello,
> 
> We run a large scale WebRTC-based VOIP application, and currently we are evaluating whether or not dirty CPU schedulers can help us get more consistent performance with some critical NIFs. We've created a test that mimics our production load and have run it on AWS EC2 instances of various sizes. When dirty CPU schedulers are working as expected, we do see measurable improvement in performance, but unfortunately we have seen that they can drop to 0 utilization. They do not recover from this state.
> 
> The timings from 2 of our 12 NIFs are in the vicinity of the 1 msec rule of thumb, so we have experimented with them both on and off dirty CPU schedulers. We've left the other NIFs on the traditional schedulers. More detailed NIF timings can be found below.
> 
> At the end of this message is the concerning data (from recon:scheduler_usage/1).We have schedulers 1-36 as the traditional schedulers bound to the first hardware thread of each core, and schedulers 37-72 are dirty CPU schedulers. Based on mpstat it appears the dirty CPU schedulers are not bound to threads. After about 10 minutes of running the load test, schedulers 42, 43, 44, 61, 63, 64, 66, 68, 72 went to at 0 utilization and stayed there indefinitely. As time goes on, more dirty CPU schedulers enter this 0 utilization state. Schedulers 1-36 do not exhibit this behaviour. We have reproduced this consistently on other EC2 c5 instances of various sizes. 
> 
> We have tried various beam adjustments, but we have seen the same behaviour each time. Some examples of things we have tried:
> +sbwt very_long +sbwtdcpu very_long +swt very_low +swtdcpu very_low
> +sfwi 500
> 
> Is there anything else that we can try to avoid this 0-utilization issue? We are happy to provide more data if needed.
> 
> NIF: webrtc_acm_add_data_10ms 
> median: 606 usec
> 90 percentile: 1230 usec
> 99 percentile: 2285 usec
> 99.9 percentile: 3223 usec
> 
> NIF: webrtc_acm_playout_data_10ms
> median: 98 usec
> 90 percentile: 371 usec
> 99 percentile: 410 usec
> 99.9 percentile: 449 usec
> 
> EC2 instance: c5.18xlarge
> 
> OS: Linux 4.4.0-1073-aws #83-Ubuntu SMP Sat Nov 17 00:26:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> beam flags: +P 4194304 +sbt db +SP 50:50 +sub true +A 128 +K true 
> 
> Erlang banner: Erlang/OTP 21 [erts-10.2] [source] [64-bit] [smp:36:36] [ds:36:36:10] [async-threads:128] [hipe]
> 
> cpu_topology:
> [{node,[{processor,[{core,[{thread,{logical,0}},
>                            {thread,{logical,36}}]},
>                     {core,[{thread,{logical,1}},{thread,{logical,37}}]},
>                     {core,[{thread,{logical,2}},{thread,{logical,38}}]},
>                     {core,[{thread,{logical,3}},{thread,{logical,39}}]},
>                     {core,[{thread,{logical,4}},{thread,{logical,40}}]},
>                     {core,[{thread,{logical,5}},{thread,{logical,41}}]},
>                     {core,[{thread,{logical,6}},{thread,{logical,42}}]},
>                     {core,[{thread,{logical,7}},{thread,{logical,43}}]},
>                     {core,[{thread,{logical,8}},{thread,{logical,44}}]},
>                     {core,[{thread,{logical,9}},{thread,{logical,45}}]},
>                     {core,[{thread,{logical,10}},{thread,{logical,46}}]},
>                     {core,[{thread,{logical,11}},{thread,{logical,47}}]},
>                     {core,[{thread,{logical,12}},{thread,{logical,48}}]},
>                     {core,[{thread,{logical,13}},{thread,{logical,49}}]},
>                     {core,[{thread,{logical,14}},{thread,{logical,50}}]},
>                     {core,[{thread,{logical,15}},{thread,{logical,51}}]},
>                     {core,[{thread,{logical,16}},{thread,{logical,52}}]},
>                     {core,[{thread,{logical,17}},{thread,{logical,53}}]}]}]},
>  {node,[{processor,[{core,[{thread,{logical,18}},
>                            {thread,{logical,54}}]},
>                     {core,[{thread,{logical,19}},{thread,{logical,55}}]},
>                     {core,[{thread,{logical,20}},{thread,{logical,56}}]},
>                     {core,[{thread,{logical,21}},{thread,{logical,57}}]},
>                     {core,[{thread,{logical,22}},{thread,{logical,58}}]},
>                     {core,[{thread,{logical,23}},{thread,{logical,59}}]},
>                     {core,[{thread,{logical,24}},{thread,{logical,60}}]},
>                     {core,[{thread,{logical,25}},{thread,{logical,61}}]},
>                     {core,[{thread,{logical,26}},{thread,{logical,62}}]},
>                     {core,[{thread,{logical,27}},{thread,{logical,63}}]},
>                     {core,[{thread,{logical,28}},{thread,{logical,64}}]},
>                     {core,[{thread,{logical,29}},{thread,{logical,65}}]},
>                     {core,[{thread,{logical,30}},{thread,{logical,66}}]},
>                     {core,[{thread,{logical,31}},{thread,{logical,67}}]},
>                     {core,[{thread,{logical,32}},{thread,{logical,68}}]},
>                     {core,[{thread,{logical,33}},{thread,{logical,69}}]},
>                     {core,[{thread,{logical,34}},{thread,{logical,70}}]},
>                     {core,[{thread,{logical,35}},{thread,{logical,71}}]}]}]}]
> 
> scheduler_bindings: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
>  23,24,25,26,27,28,29,30,31,32,33,34,35}
> 
> CPU utilization (from mpstat) during our load test:
> Linux 4.4.0-1073-aws (isw-core-x-qa-awsoh02)    01/04/2019      _x86_64_        (72 CPU)
> 
> 08:55:15 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 08:55:15 PM  all   51.58    0.00    1.64    0.01    0.00    0.42    0.00    0.00    0.00   46.36
> 08:55:15 PM    0   62.34    0.00    2.69    0.06    0.00    0.00    0.00    0.00    0.00   34.90
> 08:55:15 PM    1   62.84    0.00    2.58    0.02    0.00    0.00    0.00    0.00    0.00   34.55
> 08:55:15 PM    2   62.04    0.00    2.52    0.00    0.00    0.00    0.00    0.00    0.00   35.44
> 08:55:15 PM    3   62.17    0.00    2.62    0.00    0.00    0.00    0.00    0.00    0.00   35.21
> 08:55:15 PM    4   62.14    0.00    2.54    0.02    0.00    0.00    0.00    0.00    0.00   35.30
> 08:55:15 PM    5   62.06    0.00    2.51    0.00    0.00    0.00    0.00    0.00    0.00   35.43
> 08:55:15 PM    6   61.93    0.00    2.58    0.01    0.00    0.00    0.02    0.00    0.00   35.46
> 08:55:15 PM    7   61.99    0.00    2.54    0.00    0.00    0.00    0.00    0.00    0.00   35.48
> 08:55:15 PM    8   62.01    0.00    2.52    0.00    0.00    0.00    0.00    0.00    0.00   35.47
> 08:55:15 PM    9   61.88    0.00    2.49    0.00    0.00    0.00    0.00    0.00    0.00   35.63
> 08:55:15 PM   10   62.18    0.00    2.48    0.00    0.00    0.00    0.00    0.00    0.00   35.34
> 08:55:15 PM   11   61.89    0.00    2.47    0.00    0.00    0.00    0.00    0.00    0.00   35.64
> 08:55:15 PM   12   61.98    0.00    2.50    0.00    0.00    0.00    0.00    0.00    0.00   35.52
> 08:55:15 PM   13   61.80    0.00    2.54    0.00    0.00    0.00    0.00    0.00    0.00   35.66
> 08:55:15 PM   14   61.96    0.00    2.56    0.00    0.00    0.00    0.00    0.00    0.00   35.48
> 08:55:15 PM   15   62.11    0.00    2.57    0.00    0.00    0.00    0.00    0.00    0.00   35.31
> 08:55:15 PM   16   62.03    0.00    2.55    0.00    0.00    0.00    0.00    0.00    0.00   35.42
> 08:55:15 PM   17   61.90    0.00    2.53    0.00    0.00    0.00    0.00    0.00    0.00   35.57
> 08:55:15 PM   18   58.36    0.00    2.41    0.09    0.00    3.72    0.00    0.00    0.00   35.43
> 08:55:15 PM   19   58.29    0.00    2.45    0.02    0.00    4.00    0.00    0.00    0.00   35.24
> 08:55:15 PM   20   58.45    0.00    2.36    0.02    0.00    3.73    0.00    0.00    0.00   35.44
> 08:55:15 PM   21   58.56    0.00    2.34    0.04    0.00    3.54    0.00    0.00    0.00   35.51
> 08:55:15 PM   22   58.12    0.00    2.46    0.00    0.00    3.84    0.00    0.00    0.00   35.58
> 08:55:15 PM   23   58.24    0.00    2.36    0.01    0.00    3.82    0.00    0.00    0.00   35.57
> 08:55:15 PM   24   58.84    0.00    2.42    0.00    0.00    3.60    0.00    0.00    0.00   35.14
> 08:55:15 PM   25   58.35    0.00    2.43    0.00    0.00    3.72    0.00    0.00    0.00   35.51
> 08:55:15 PM   26   62.09    0.00    2.47    0.00    0.00    0.00    0.00    0.00    0.00   35.45
> 08:55:15 PM   27   62.09    0.00    2.46    0.00    0.00    0.00    0.00    0.00    0.00   35.45
> 08:55:15 PM   28   62.15    0.00    2.48    0.00    0.00    0.00    0.00    0.00    0.00   35.37
> 08:55:15 PM   29   62.09    0.00    2.46    0.00    0.00    0.00    0.00    0.00    0.00   35.45
> 08:55:15 PM   30   62.74    0.00    2.45    0.00    0.00    0.00    0.00    0.00    0.00   34.81
> 08:55:15 PM   31   62.14    0.00    2.47    0.00    0.00    0.00    0.00    0.00    0.00   35.39
> 08:55:15 PM   32   62.35    0.00    2.52    0.00    0.00    0.00    0.00    0.00    0.00   35.13
> 08:55:15 PM   33   62.13    0.00    2.42    0.00    0.00    0.00    0.00    0.00    0.00   35.46
> 08:55:15 PM   34   62.08    0.00    2.47    0.00    0.00    0.00    0.00    0.00    0.00   35.45
> 08:55:15 PM   35   62.15    0.00    2.49    0.00    0.00    0.00    0.00    0.00    0.00   35.35
> 08:55:15 PM   36   43.19    0.00    0.80    0.00    0.00    0.00    0.00    0.00    0.00   56.01
> 08:55:15 PM   37   43.09    0.00    0.83    0.01    0.00    0.00    0.00    0.00    0.00   56.06
> 08:55:15 PM   38   42.59    0.00    0.78    0.01    0.00    0.00    0.00    0.00    0.00   56.63
> 08:55:15 PM   39   42.60    0.00    0.82    0.00    0.00    0.00    0.00    0.00    0.00   56.58
> 08:55:15 PM   40   42.94    0.00    0.76    0.02    0.00    0.00    0.00    0.00    0.00   56.28
> 08:55:15 PM   41   42.11    0.00    0.76    0.01    0.00    0.00    0.00    0.00    0.00   57.13
> 08:55:15 PM   42   43.26    0.00    0.85    0.01    0.00    0.00    0.00    0.00    0.00   55.88
> 08:55:15 PM   43   42.23    0.00    0.77    0.01    0.00    0.00    0.00    0.00    0.00   56.99
> 08:55:15 PM   44   42.24    0.00    0.80    0.00    0.00    0.00    0.00    0.00    0.00   56.95
> 08:55:15 PM   45   41.65    0.00    0.79    0.01    0.00    0.00    0.00    0.00    0.00   57.55
> 08:55:15 PM   46   41.76    0.00    0.77    0.01    0.00    0.00    0.00    0.00    0.00   57.46
> 08:55:15 PM   47   42.66    0.00    0.82    0.00    0.00    0.00    0.00    0.00    0.00   56.52
> 08:55:15 PM   48   41.83    0.00    0.76    0.01    0.00    0.00    0.00    0.00    0.00   57.40
> 08:55:15 PM   49   42.04    0.00    0.82    0.00    0.00    0.00    0.00    0.00    0.00   57.13
> 08:55:15 PM   50   41.55    0.00    0.78    0.01    0.00    0.00    0.00    0.00    0.00   57.67
> 08:55:15 PM   51   42.37    0.00    0.78    0.00    0.00    0.00    0.00    0.00    0.00   56.84
> 08:55:15 PM   52   41.66    0.00    0.77    0.00    0.00    0.00    0.00    0.00    0.00   57.57
> 08:55:15 PM   53   41.77    0.00    0.79    0.00    0.00    0.00    0.00    0.00    0.00   57.45
> 08:55:15 PM   54   40.33    0.00    0.88    0.01    0.00    0.00    0.00    0.00    0.00   58.79
> 08:55:15 PM   55   41.62    0.00    0.78    0.01    0.00    0.00    0.00    0.00    0.00   57.59
> 08:55:15 PM   56   41.51    0.00    0.82    0.00    0.00    0.00    0.00    0.00    0.00   57.67
> 08:55:15 PM   57   40.35    0.00    0.76    0.00    0.00    0.00    0.00    0.00    0.00   58.89
> 08:55:15 PM   58   41.92    0.00    0.79    0.01    0.00    0.00    0.00    0.00    0.00   57.28
> 08:55:15 PM   59   41.48    0.00    0.80    0.00    0.00    0.00    0.00    0.00    0.00   57.72
> 08:55:15 PM   60   40.83    0.00    0.75    0.01    0.00    0.00    0.00    0.00    0.00   58.41
> 08:55:15 PM   61   40.95    0.00    0.78    0.01    0.00    0.00    0.00    0.00    0.00   58.26
> 08:55:15 PM   62   40.85    0.00    0.80    0.01    0.00    0.00    0.00    0.00    0.00   58.35
> 08:55:15 PM   63   42.11    0.00    0.77    0.01    0.00    0.00    0.00    0.00    0.00   57.12
> 08:55:15 PM   64   42.22    0.00    0.78    0.01    0.00    0.00    0.00    0.00    0.00   56.99
> 08:55:15 PM   65   41.55    0.00    0.78    0.00    0.00    0.00    0.00    0.00    0.00   57.67
> 08:55:15 PM   66   41.68    0.00    0.81    0.00    0.00    0.00    0.00    0.00    0.00   57.51
> 08:55:15 PM   67   41.06    0.00    0.87    0.01    0.00    0.00    0.00    0.00    0.00   58.07
> 08:55:15 PM   68   41.80    0.00    0.79    0.00    0.00    0.00    0.00    0.00    0.00   57.41
> 08:55:15 PM   69   41.38    0.00    0.75    0.00    0.00    0.00    0.00    0.00    0.00   57.87
> 08:55:15 PM   70   42.17    0.00    0.75    0.00    0.00    0.00    0.00    0.00    0.00   57.08
> 08:55:15 PM   71   41.26    0.00    0.81    0.00    0.00    0.00    0.01    0.00    0.00   57.92
> 
> 
> 1> rp(recon:scheduler_usage(1000)).
> [{1,0.8581586716045433},
>  {2,0.8607597847736659},
>  {3,0.8501543449289042},
>  {4,0.8619718426658023},
>  {5,0.8660226578705742},
>  {6,0.8555763933364806},
>  {7,0.8557195972097537},
>  {8,0.8668284625654983},
>  {9,0.858588675506532},
>  {10,0.8650501369640686},
>  {11,0.8615557214275474},
>  {12,0.8523392789683996},
>  {13,0.8469329348407398},
>  {14,0.8615637534991641},
>  {15,0.8591930739986557},
>  {16,0.8578596164201188},
>  {17,0.8549696086049403},
>  {18,0.8677098230685564},
>  {19,0.8556328340054443},
>  {20,0.8642542459479063},
>  {21,0.8670655179064826},
>  {22,0.8562638308551557},
>  {23,0.8511202058055239},
>  {24,0.8697339264798696},
>  {25,0.8594823462601853},
>  {26,0.8634550008001265},
>  {27,0.8568590130172198},
>  {28,0.864025757952233},
>  {29,0.8575614770622326},
>  {30,0.8611760239143281},
>  {31,0.8589789125669829},
>  {32,0.8537754468447621},
>  {33,0.8675538912040871},
>  {34,0.8607522803519515},
>  {35,0.8639389099445753},
>  {36,0.8743759133171144},
>  {37,0.9526507413773687},
>  {38,0.948925611917577},
>  {39,0.9511904612810083},
>  {40,0.9581015421004746},
>  {41,0.9534977201467516},
>  {42,0.0},
>  {43,0.0},
>  {44,0.0},
>  {45,0.9522518499948341},
>  {46,0.9453306755503674},
>  {47,0.9586374601140047},
>  {48,0.9574611845671873},
>  {49,0.950914304057791},
>  {50,0.9582124294398642},
>  {51,0.9448074463900029},
>  {52,0.9474771999511407},
>  {53,0.9450778098401229},
>  {54,0.0},
>  {55,0.9478167911711937},
>  {56,0.9612587860764551},
>  {57,0.943468696358506},
>  {58,0.9509498642900455},
>  {59,0.9603595134495414},
>  {60,0.9517397912369205},
>  {61,0.0},
>  {62,0.9587029756970658},
>  {63,0.0},
>  {64,0.0},
>  {65,0.9526167902065421},
>  {66,0.0},
>  {67,0.9475668680715805},
>  {68,0.0},
>  {69,0.9630301804839606},
>  {70,0.9475359592879946},
>  {71,0.9541718747681374},
>  {72,0.0}]
> 
> 
> Thanks for your time,
> 
> Jesse Stimpson
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions




More information about the erlang-questions mailing list