[erlang-questions] Dirty CPU schedulers stuck at zero utilization
Jesse Stimpson
jstimpson@REDACTED
Wed Jan 9 22:51:57 CET 2019
Rikard and Jonas,
Thank you for your response. With the patch at OTP PR-2027, we do observe
that the dirty CPU schedulers do not drop to zero utilization, but in
addition we see that the performance of the traditional schedulers
suffers. Specifically, when we apply our load test, scheduler usage
toggles in between the following two states. I've snipped the output of
similar values for readability.
[{1,0.8933020805456553},
...
{36,0.8882068476700309},
{37,0.4932569692588096},
...
{72,0.49842881346116297}]
and
[{1,1.0},
...
{36,1.0},
{37,0.42909590032946493},
...
{72,0.4641339106204525}]
When applying this same load to the un-patched Erlang 21.2, the traditional
schedulers are utilized at about 75%. So it appears that the 2027 patch may
put undesirable pressure on the schedulers as a trade-off for consistent
performance from the dirty CPU schedulers. Are you observing a similar
response on your system?
Thanks,
Jesse
On Wed, Jan 9, 2019 at 3:33 AM Jonas Falkevik <jonas.falkevik@REDACTED>
wrote:
> Hi,
> This sounds like the issue I bumped into.
> After some time, many of the dirty schedulers are not performing any work
> anymore.
> Which was related to a corrupted list of sleeping dirty schedulers.
>
> I haven’t found what triggered it in my case (maybe a GC?). For me it took
> approx. 1 day before it happened.
>
> Doing Multi Scheduler Block can trigger the problem with the list of
> sleeping dirty schedulers which gets corrupted.
> The sleepers list can contain only one or some schedulers. That list is
> later used to wake up schedulers when there is work to do.
> For each enqueued dirty-cpu or dirty-io job, it wakes the first scheduler
> in the sleepers list.
> But if this list only point to a few schedulers only those are woken up to
> pick jobs from the queue.
>
> Since fixing that specific problem I have not seen the collapse of dirty
> schedulers anymore.
>
> It would be great if you can test and see if it fixes your problem as
> well, https://github.com/erlang/otp/pull/2027
> Any feedback on the bug fix is appreciated.
>
> /Jonas
>
> > On 8 Jan 2019, at 17:13, Jesse Stimpson <jstimpson@REDACTED>
> wrote:
> >
> > Hello,
> >
> > We run a large scale WebRTC-based VOIP application, and currently we are
> evaluating whether or not dirty CPU schedulers can help us get more
> consistent performance with some critical NIFs. We've created a test that
> mimics our production load and have run it on AWS EC2 instances of various
> sizes. When dirty CPU schedulers are working as expected, we do see
> measurable improvement in performance, but unfortunately we have seen that
> they can drop to 0 utilization. They do not recover from this state.
> >
> > The timings from 2 of our 12 NIFs are in the vicinity of the 1 msec rule
> of thumb, so we have experimented with them both on and off dirty CPU
> schedulers. We've left the other NIFs on the traditional schedulers. More
> detailed NIF timings can be found below.
> >
> > At the end of this message is the concerning data (from
> recon:scheduler_usage/1).We have schedulers 1-36 as the traditional
> schedulers bound to the first hardware thread of each core, and schedulers
> 37-72 are dirty CPU schedulers. Based on mpstat it appears the dirty CPU
> schedulers are not bound to threads. After about 10 minutes of running the
> load test, schedulers 42, 43, 44, 61, 63, 64, 66, 68, 72 went to at 0
> utilization and stayed there indefinitely. As time goes on, more dirty CPU
> schedulers enter this 0 utilization state. Schedulers 1-36 do not exhibit
> this behaviour. We have reproduced this consistently on other EC2 c5
> instances of various sizes.
> >
> > We have tried various beam adjustments, but we have seen the same
> behaviour each time. Some examples of things we have tried:
> > +sbwt very_long +sbwtdcpu very_long +swt very_low +swtdcpu very_low
> > +sfwi 500
> >
> > Is there anything else that we can try to avoid this 0-utilization
> issue? We are happy to provide more data if needed.
> >
> > NIF: webrtc_acm_add_data_10ms
> > median: 606 usec
> > 90 percentile: 1230 usec
> > 99 percentile: 2285 usec
> > 99.9 percentile: 3223 usec
> >
> > NIF: webrtc_acm_playout_data_10ms
> > median: 98 usec
> > 90 percentile: 371 usec
> > 99 percentile: 410 usec
> > 99.9 percentile: 449 usec
> >
> > EC2 instance: c5.18xlarge
> >
> > OS: Linux 4.4.0-1073-aws #83-Ubuntu SMP Sat Nov 17 00:26:27 UTC 2018
> x86_64 x86_64 x86_64 GNU/Linux
> >
> > beam flags: +P 4194304 +sbt db +SP 50:50 +sub true +A 128 +K true
> >
> > Erlang banner: Erlang/OTP 21 [erts-10.2] [source] [64-bit] [smp:36:36]
> [ds:36:36:10] [async-threads:128] [hipe]
> >
> > cpu_topology:
> > [{node,[{processor,[{core,[{thread,{logical,0}},
> > {thread,{logical,36}}]},
> > {core,[{thread,{logical,1}},{thread,{logical,37}}]},
> > {core,[{thread,{logical,2}},{thread,{logical,38}}]},
> > {core,[{thread,{logical,3}},{thread,{logical,39}}]},
> > {core,[{thread,{logical,4}},{thread,{logical,40}}]},
> > {core,[{thread,{logical,5}},{thread,{logical,41}}]},
> > {core,[{thread,{logical,6}},{thread,{logical,42}}]},
> > {core,[{thread,{logical,7}},{thread,{logical,43}}]},
> > {core,[{thread,{logical,8}},{thread,{logical,44}}]},
> > {core,[{thread,{logical,9}},{thread,{logical,45}}]},
> > {core,[{thread,{logical,10}},{thread,{logical,46}}]},
> > {core,[{thread,{logical,11}},{thread,{logical,47}}]},
> > {core,[{thread,{logical,12}},{thread,{logical,48}}]},
> > {core,[{thread,{logical,13}},{thread,{logical,49}}]},
> > {core,[{thread,{logical,14}},{thread,{logical,50}}]},
> > {core,[{thread,{logical,15}},{thread,{logical,51}}]},
> > {core,[{thread,{logical,16}},{thread,{logical,52}}]},
> >
> {core,[{thread,{logical,17}},{thread,{logical,53}}]}]}]},
> > {node,[{processor,[{core,[{thread,{logical,18}},
> > {thread,{logical,54}}]},
> > {core,[{thread,{logical,19}},{thread,{logical,55}}]},
> > {core,[{thread,{logical,20}},{thread,{logical,56}}]},
> > {core,[{thread,{logical,21}},{thread,{logical,57}}]},
> > {core,[{thread,{logical,22}},{thread,{logical,58}}]},
> > {core,[{thread,{logical,23}},{thread,{logical,59}}]},
> > {core,[{thread,{logical,24}},{thread,{logical,60}}]},
> > {core,[{thread,{logical,25}},{thread,{logical,61}}]},
> > {core,[{thread,{logical,26}},{thread,{logical,62}}]},
> > {core,[{thread,{logical,27}},{thread,{logical,63}}]},
> > {core,[{thread,{logical,28}},{thread,{logical,64}}]},
> > {core,[{thread,{logical,29}},{thread,{logical,65}}]},
> > {core,[{thread,{logical,30}},{thread,{logical,66}}]},
> > {core,[{thread,{logical,31}},{thread,{logical,67}}]},
> > {core,[{thread,{logical,32}},{thread,{logical,68}}]},
> > {core,[{thread,{logical,33}},{thread,{logical,69}}]},
> > {core,[{thread,{logical,34}},{thread,{logical,70}}]},
> >
> {core,[{thread,{logical,35}},{thread,{logical,71}}]}]}]}]
> >
> > scheduler_bindings:
> {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
> > 23,24,25,26,27,28,29,30,31,32,33,34,35}
> >
> > CPU utilization (from mpstat) during our load test:
> > Linux 4.4.0-1073-aws (isw-core-x-qa-awsoh02) 01/04/2019
> _x86_64_ (72 CPU)
> >
> > 08:55:15 PM CPU %usr %nice %sys %iowait %irq %soft
> %steal %guest %gnice %idle
> > 08:55:15 PM all 51.58 0.00 1.64 0.01 0.00 0.42
> 0.00 0.00 0.00 46.36
> > 08:55:15 PM 0 62.34 0.00 2.69 0.06 0.00 0.00
> 0.00 0.00 0.00 34.90
> > 08:55:15 PM 1 62.84 0.00 2.58 0.02 0.00 0.00
> 0.00 0.00 0.00 34.55
> > 08:55:15 PM 2 62.04 0.00 2.52 0.00 0.00 0.00
> 0.00 0.00 0.00 35.44
> > 08:55:15 PM 3 62.17 0.00 2.62 0.00 0.00 0.00
> 0.00 0.00 0.00 35.21
> > 08:55:15 PM 4 62.14 0.00 2.54 0.02 0.00 0.00
> 0.00 0.00 0.00 35.30
> > 08:55:15 PM 5 62.06 0.00 2.51 0.00 0.00 0.00
> 0.00 0.00 0.00 35.43
> > 08:55:15 PM 6 61.93 0.00 2.58 0.01 0.00 0.00
> 0.02 0.00 0.00 35.46
> > 08:55:15 PM 7 61.99 0.00 2.54 0.00 0.00 0.00
> 0.00 0.00 0.00 35.48
> > 08:55:15 PM 8 62.01 0.00 2.52 0.00 0.00 0.00
> 0.00 0.00 0.00 35.47
> > 08:55:15 PM 9 61.88 0.00 2.49 0.00 0.00 0.00
> 0.00 0.00 0.00 35.63
> > 08:55:15 PM 10 62.18 0.00 2.48 0.00 0.00 0.00
> 0.00 0.00 0.00 35.34
> > 08:55:15 PM 11 61.89 0.00 2.47 0.00 0.00 0.00
> 0.00 0.00 0.00 35.64
> > 08:55:15 PM 12 61.98 0.00 2.50 0.00 0.00 0.00
> 0.00 0.00 0.00 35.52
> > 08:55:15 PM 13 61.80 0.00 2.54 0.00 0.00 0.00
> 0.00 0.00 0.00 35.66
> > 08:55:15 PM 14 61.96 0.00 2.56 0.00 0.00 0.00
> 0.00 0.00 0.00 35.48
> > 08:55:15 PM 15 62.11 0.00 2.57 0.00 0.00 0.00
> 0.00 0.00 0.00 35.31
> > 08:55:15 PM 16 62.03 0.00 2.55 0.00 0.00 0.00
> 0.00 0.00 0.00 35.42
> > 08:55:15 PM 17 61.90 0.00 2.53 0.00 0.00 0.00
> 0.00 0.00 0.00 35.57
> > 08:55:15 PM 18 58.36 0.00 2.41 0.09 0.00 3.72
> 0.00 0.00 0.00 35.43
> > 08:55:15 PM 19 58.29 0.00 2.45 0.02 0.00 4.00
> 0.00 0.00 0.00 35.24
> > 08:55:15 PM 20 58.45 0.00 2.36 0.02 0.00 3.73
> 0.00 0.00 0.00 35.44
> > 08:55:15 PM 21 58.56 0.00 2.34 0.04 0.00 3.54
> 0.00 0.00 0.00 35.51
> > 08:55:15 PM 22 58.12 0.00 2.46 0.00 0.00 3.84
> 0.00 0.00 0.00 35.58
> > 08:55:15 PM 23 58.24 0.00 2.36 0.01 0.00 3.82
> 0.00 0.00 0.00 35.57
> > 08:55:15 PM 24 58.84 0.00 2.42 0.00 0.00 3.60
> 0.00 0.00 0.00 35.14
> > 08:55:15 PM 25 58.35 0.00 2.43 0.00 0.00 3.72
> 0.00 0.00 0.00 35.51
> > 08:55:15 PM 26 62.09 0.00 2.47 0.00 0.00 0.00
> 0.00 0.00 0.00 35.45
> > 08:55:15 PM 27 62.09 0.00 2.46 0.00 0.00 0.00
> 0.00 0.00 0.00 35.45
> > 08:55:15 PM 28 62.15 0.00 2.48 0.00 0.00 0.00
> 0.00 0.00 0.00 35.37
> > 08:55:15 PM 29 62.09 0.00 2.46 0.00 0.00 0.00
> 0.00 0.00 0.00 35.45
> > 08:55:15 PM 30 62.74 0.00 2.45 0.00 0.00 0.00
> 0.00 0.00 0.00 34.81
> > 08:55:15 PM 31 62.14 0.00 2.47 0.00 0.00 0.00
> 0.00 0.00 0.00 35.39
> > 08:55:15 PM 32 62.35 0.00 2.52 0.00 0.00 0.00
> 0.00 0.00 0.00 35.13
> > 08:55:15 PM 33 62.13 0.00 2.42 0.00 0.00 0.00
> 0.00 0.00 0.00 35.46
> > 08:55:15 PM 34 62.08 0.00 2.47 0.00 0.00 0.00
> 0.00 0.00 0.00 35.45
> > 08:55:15 PM 35 62.15 0.00 2.49 0.00 0.00 0.00
> 0.00 0.00 0.00 35.35
> > 08:55:15 PM 36 43.19 0.00 0.80 0.00 0.00 0.00
> 0.00 0.00 0.00 56.01
> > 08:55:15 PM 37 43.09 0.00 0.83 0.01 0.00 0.00
> 0.00 0.00 0.00 56.06
> > 08:55:15 PM 38 42.59 0.00 0.78 0.01 0.00 0.00
> 0.00 0.00 0.00 56.63
> > 08:55:15 PM 39 42.60 0.00 0.82 0.00 0.00 0.00
> 0.00 0.00 0.00 56.58
> > 08:55:15 PM 40 42.94 0.00 0.76 0.02 0.00 0.00
> 0.00 0.00 0.00 56.28
> > 08:55:15 PM 41 42.11 0.00 0.76 0.01 0.00 0.00
> 0.00 0.00 0.00 57.13
> > 08:55:15 PM 42 43.26 0.00 0.85 0.01 0.00 0.00
> 0.00 0.00 0.00 55.88
> > 08:55:15 PM 43 42.23 0.00 0.77 0.01 0.00 0.00
> 0.00 0.00 0.00 56.99
> > 08:55:15 PM 44 42.24 0.00 0.80 0.00 0.00 0.00
> 0.00 0.00 0.00 56.95
> > 08:55:15 PM 45 41.65 0.00 0.79 0.01 0.00 0.00
> 0.00 0.00 0.00 57.55
> > 08:55:15 PM 46 41.76 0.00 0.77 0.01 0.00 0.00
> 0.00 0.00 0.00 57.46
> > 08:55:15 PM 47 42.66 0.00 0.82 0.00 0.00 0.00
> 0.00 0.00 0.00 56.52
> > 08:55:15 PM 48 41.83 0.00 0.76 0.01 0.00 0.00
> 0.00 0.00 0.00 57.40
> > 08:55:15 PM 49 42.04 0.00 0.82 0.00 0.00 0.00
> 0.00 0.00 0.00 57.13
> > 08:55:15 PM 50 41.55 0.00 0.78 0.01 0.00 0.00
> 0.00 0.00 0.00 57.67
> > 08:55:15 PM 51 42.37 0.00 0.78 0.00 0.00 0.00
> 0.00 0.00 0.00 56.84
> > 08:55:15 PM 52 41.66 0.00 0.77 0.00 0.00 0.00
> 0.00 0.00 0.00 57.57
> > 08:55:15 PM 53 41.77 0.00 0.79 0.00 0.00 0.00
> 0.00 0.00 0.00 57.45
> > 08:55:15 PM 54 40.33 0.00 0.88 0.01 0.00 0.00
> 0.00 0.00 0.00 58.79
> > 08:55:15 PM 55 41.62 0.00 0.78 0.01 0.00 0.00
> 0.00 0.00 0.00 57.59
> > 08:55:15 PM 56 41.51 0.00 0.82 0.00 0.00 0.00
> 0.00 0.00 0.00 57.67
> > 08:55:15 PM 57 40.35 0.00 0.76 0.00 0.00 0.00
> 0.00 0.00 0.00 58.89
> > 08:55:15 PM 58 41.92 0.00 0.79 0.01 0.00 0.00
> 0.00 0.00 0.00 57.28
> > 08:55:15 PM 59 41.48 0.00 0.80 0.00 0.00 0.00
> 0.00 0.00 0.00 57.72
> > 08:55:15 PM 60 40.83 0.00 0.75 0.01 0.00 0.00
> 0.00 0.00 0.00 58.41
> > 08:55:15 PM 61 40.95 0.00 0.78 0.01 0.00 0.00
> 0.00 0.00 0.00 58.26
> > 08:55:15 PM 62 40.85 0.00 0.80 0.01 0.00 0.00
> 0.00 0.00 0.00 58.35
> > 08:55:15 PM 63 42.11 0.00 0.77 0.01 0.00 0.00
> 0.00 0.00 0.00 57.12
> > 08:55:15 PM 64 42.22 0.00 0.78 0.01 0.00 0.00
> 0.00 0.00 0.00 56.99
> > 08:55:15 PM 65 41.55 0.00 0.78 0.00 0.00 0.00
> 0.00 0.00 0.00 57.67
> > 08:55:15 PM 66 41.68 0.00 0.81 0.00 0.00 0.00
> 0.00 0.00 0.00 57.51
> > 08:55:15 PM 67 41.06 0.00 0.87 0.01 0.00 0.00
> 0.00 0.00 0.00 58.07
> > 08:55:15 PM 68 41.80 0.00 0.79 0.00 0.00 0.00
> 0.00 0.00 0.00 57.41
> > 08:55:15 PM 69 41.38 0.00 0.75 0.00 0.00 0.00
> 0.00 0.00 0.00 57.87
> > 08:55:15 PM 70 42.17 0.00 0.75 0.00 0.00 0.00
> 0.00 0.00 0.00 57.08
> > 08:55:15 PM 71 41.26 0.00 0.81 0.00 0.00 0.00
> 0.01 0.00 0.00 57.92
> >
> >
> > 1> rp(recon:scheduler_usage(1000)).
> > [{1,0.8581586716045433},
> > {2,0.8607597847736659},
> > {3,0.8501543449289042},
> > {4,0.8619718426658023},
> > {5,0.8660226578705742},
> > {6,0.8555763933364806},
> > {7,0.8557195972097537},
> > {8,0.8668284625654983},
> > {9,0.858588675506532},
> > {10,0.8650501369640686},
> > {11,0.8615557214275474},
> > {12,0.8523392789683996},
> > {13,0.8469329348407398},
> > {14,0.8615637534991641},
> > {15,0.8591930739986557},
> > {16,0.8578596164201188},
> > {17,0.8549696086049403},
> > {18,0.8677098230685564},
> > {19,0.8556328340054443},
> > {20,0.8642542459479063},
> > {21,0.8670655179064826},
> > {22,0.8562638308551557},
> > {23,0.8511202058055239},
> > {24,0.8697339264798696},
> > {25,0.8594823462601853},
> > {26,0.8634550008001265},
> > {27,0.8568590130172198},
> > {28,0.864025757952233},
> > {29,0.8575614770622326},
> > {30,0.8611760239143281},
> > {31,0.8589789125669829},
> > {32,0.8537754468447621},
> > {33,0.8675538912040871},
> > {34,0.8607522803519515},
> > {35,0.8639389099445753},
> > {36,0.8743759133171144},
> > {37,0.9526507413773687},
> > {38,0.948925611917577},
> > {39,0.9511904612810083},
> > {40,0.9581015421004746},
> > {41,0.9534977201467516},
> > {42,0.0},
> > {43,0.0},
> > {44,0.0},
> > {45,0.9522518499948341},
> > {46,0.9453306755503674},
> > {47,0.9586374601140047},
> > {48,0.9574611845671873},
> > {49,0.950914304057791},
> > {50,0.9582124294398642},
> > {51,0.9448074463900029},
> > {52,0.9474771999511407},
> > {53,0.9450778098401229},
> > {54,0.0},
> > {55,0.9478167911711937},
> > {56,0.9612587860764551},
> > {57,0.943468696358506},
> > {58,0.9509498642900455},
> > {59,0.9603595134495414},
> > {60,0.9517397912369205},
> > {61,0.0},
> > {62,0.9587029756970658},
> > {63,0.0},
> > {64,0.0},
> > {65,0.9526167902065421},
> > {66,0.0},
> > {67,0.9475668680715805},
> > {68,0.0},
> > {69,0.9630301804839606},
> > {70,0.9475359592879946},
> > {71,0.9541718747681374},
> > {72,0.0}]
> >
> >
> > Thanks for your time,
> >
> > Jesse Stimpson
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
>
>
--
<http://www.republicwireless.com/>
Jesse Stimpson
Site Reliability Engineering
m: 9199950424 <(919)%20995-0424>
RepublicWireless.com <https://republicwireless.com/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190109/b2d5fc9c/attachment.htm>
More information about the erlang-questions
mailing list