<div dir="ltr"><p class="MsoNormal"><span lang="EN-US">Hi,</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">We find an Erlang unbalanced scheduler problem
on our proxy servers(for processing and forwarding requests from clients). </span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US"><b>Env:</b></span></p>
<p class="MsoNormal"><span lang="EN-US">Erlang Version: R16B03/SMP 24 schedulers
online/-swt low</span></p>
<p class="MsoNormal"><span lang="EN-US">Mem: 64G, 1.6G occupied by beam.smp</span></p>
<p class="MsoNormal"><span lang="EN-US">CPU: 24 Xeon cores</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US"><b>Issue Description:</b></span></p>
<p class="MsoNormal"><span lang="EN-US">On clients, we establish some tcp
keep-alive connections to proxy servers; time to time, we make some tcp short
(non keep-alive) connections to proxy servers. Then clients send requests to
proxy servers, proxy processes & forwards requests to backend servers.</span></p>
<p class="MsoNormal"><span lang="EN-US">During the test, we find 11 schedulers’
usage is 100% while other 13 schedulers’ usage is 0(idle), and busy ones are
with long runq (rq->len around 100). Sometimes maybe 21 busy, 3 idle. Such
state lasts for from 30 minutes to 6 hours in an uncertain way. Sometimes it
disappears, after an hour or a day, occasionly, it comes back again.</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US"><b>Debug:</b></span></p>
<p class="MsoNormal"><span lang="EN-US">So I write a gdb script to catch the runq
flags and other struct members, I want to know what happens when the problem
comes up. Collect.sh is used to catch the runq state every 2 seconds, wakeix.sh
is used to catch which runq(ix) is waked up when schedulers in the unbalanced
state.</span></p>
<p class="MsoNormal"><span lang="EN-US">//////////////////////////////////////////This
is a separator/////////////////////////////////////////</span></p>
<p class="MsoNormal"><span lang="EN-US">#cat collect.sh </span></p>
<p class="MsoNormal"><span lang="EN-US">#!/bin/sh</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">while [ 1 ];</span></p>
<p class="MsoNormal"><span lang="EN-US">do</span></p>
<p class="MsoNormal"><span lang="EN-US">date</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">#cat wakeix.sh </span></p>
<p class="MsoNormal"><span lang="EN-US">#!/bin/sh</span></p>
<p class="MsoNormal"><span lang="EN-US">//////////////////////////////////////////This
is a separator/////////////////////////////////////////</span></p>
<p class="MsoNormal"><span lang="EN-US">gdb -p 9075 <<EOF</span></p>
<p class="MsoNormal"><span lang="EN-US">b wake_scheduler</span></p>
<p class="MsoNormal"><span lang="EN-US">c</span></p>
<p class="MsoNormal"><span lang="EN-US">print_rdi</span></p>
<p class="MsoNormal"><span lang="EN-US">detach</span></p>
<p class="MsoNormal"><span lang="EN-US">quit</span></p>
<p class="MsoNormal"><span lang="EN-US">EOF</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">#9075 is process id of beam.smp</span></p>
<p class="MsoNormal"><span lang="EN-US">gdb -p 9075 <<EOF</span></p>
<p class="MsoNormal"><span lang="EN-US">print_wakeup_reds</span></p>
<p class="MsoNormal"><span lang="EN-US">detach</span></p>
<p class="MsoNormal"><span lang="EN-US">quit</span></p>
<p class="MsoNormal"><span lang="EN-US">EOF</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">sleep 2</span></p>
<p class="MsoNormal"><span lang="EN-US">done</span></p>
<p class="MsoNormal"><span lang="EN-US">//////////////////////////////////////////This
is a separator/////////////////////////////////////////</span></p>
<p class="MsoNormal"><span lang="EN-US">#cat .gdbinit </span></p>
<p class="MsoNormal"><span lang="EN-US">define print_rdi</span></p>
<p class="MsoNormal"><span lang="EN-US">
set $a = *(int*)$rdi</span></p>
<p class="MsoNormal"><span lang="EN-US">
printf "wake ix:%d\n", $a</span></p>
<p class="MsoNormal"><span lang="EN-US">end</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">define print_wakeup_reds</span></p>
<p class="MsoNormal"><span lang="EN-US">
set $balance_info = (long)&balance_info</span></p>
<p class="MsoNormal"><span lang="EN-US">
set $a = $balance_info + 664</span></p>
<p class="MsoNormal"><span lang="EN-US">
set $no_runqs = *(int*)$a</span></p>
<p class="MsoNormal"><span lang="EN-US">
set $addr = (long)&erts_aligned_run_queues</span></p>
<p class="MsoNormal"><span lang="EN-US">
set $start = *(long*)$addr</span></p>
<p class="MsoNormal"><span lang="EN-US">
set $i = 0</span></p>
<p class="MsoNormal"><span lang="EN-US">
while($i<24)</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $a1 = $i*1024+$start+0x318</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $a2 = $i*1024+$start+0x314</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $a3 = $i*1024+$start+0x310</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $a4 = $i*1024+$start+0x2e0</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $a5 = $i*1024+$start+0x2dc</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $a6 = $i*1024+$start+0x2d8</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $a7 = $i*1024+$start</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $wakeup_other_reds =
*(int*)$a1</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $wakeup_other = *(int*)$a2</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $len = *(int*)$a3</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $flags = *(int*)$a4</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $woken = *(int*)$a5</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $waiting = *(int*)$a6</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $ix = *(int*)$a7</span></p>
<p class="MsoNormal"><span lang="EN-US"> printf "ix:%d len:%d
wakeup_other:%d wakeup_other_reds:%d woken:%d waiting:%d flags:%x ", $ix,
$len, $wakeup_other, $wakeup_other_reds, $woken, $waiting, $flags</span></p>
<p class="MsoNormal"><span lang="EN-US"> parse_flags $flags</span></p>
<p class="MsoNormal"><span lang="EN-US"> printf "\n"</span></p>
<p class="MsoNormal"><span lang="EN-US"> set $i = $i+1</span></p>
<p class="MsoNormal"><span lang="EN-US">
end</span></p>
<p class="MsoNormal"><span lang="EN-US">
printf "no_runqs:%x run_q:%d empty_q:%d\n", $no_runqs,
erts_no_run_queues, no_empty_run_queues</span></p>
<p class="MsoNormal"><span lang="EN-US">
detach</span></p>
<p class="MsoNormal"><span lang="EN-US">
quit</span></p>
<p class="MsoNormal"><span lang="EN-US">end</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">define parse_flags</span></p>
<p class="MsoNormal"><span lang="EN-US">
set $f = $arg0</span></p>
<p class="MsoNormal"><span lang="EN-US">
if (($f&0x100000) != 0) </span></p>
<p class="MsoNormal"><span lang="EN-US"> printf "OUT_OF_WORK "</span></p>
<p class="MsoNormal"><span lang="EN-US">
end</span></p>
<p class="MsoNormal"><span lang="EN-US">
if (($f&0x200000) != 0) </span></p>
<p class="MsoNormal"><span lang="EN-US"> printf
"HALFTIME_OUT_OF_WORK "</span></p>
<p class="MsoNormal"><span lang="EN-US">
end</span></p>
<p class="MsoNormal"><span lang="EN-US">
if (($f&0x400000) != 0) </span></p>
<p class="MsoNormal"><span lang="EN-US"> printf "SUSPENDED "</span></p>
<p class="MsoNormal"><span lang="EN-US">
end</span></p>
<p class="MsoNormal"><span lang="EN-US">
if (($f&0x800000) != 0) </span></p>
<p class="MsoNormal"><span lang="EN-US"> printf "CHK_CPU_BIND
"</span></p>
<p class="MsoNormal"><span lang="EN-US">
end</span></p>
<p class="MsoNormal"><span lang="EN-US">
if (($f&0x1000000) != 0) </span></p>
<p class="MsoNormal"><span lang="EN-US"> printf "INACTIVE "</span></p>
<p class="MsoNormal"><span lang="EN-US">
end</span></p>
<p class="MsoNormal"><span lang="EN-US">
if (($f&0x2000000) != 0) </span></p>
<p class="MsoNormal"><span lang="EN-US"> printf "NONEMPTY "</span></p>
<p class="MsoNormal"><span lang="EN-US">
end</span></p>
<p class="MsoNormal"><span lang="EN-US">
if (($f&0x4000000) != 0) </span></p>
<p class="MsoNormal"><span lang="EN-US"> printf "PROTECTED"</span></p>
<p class="MsoNormal"><span lang="EN-US">
end</span></p>
<p class="MsoNormal"><span lang="EN-US">end</span></p>
<p class="MsoNormal"><span lang="EN-US">//////////////////////////////////////////This
is a separator/////////////////////////////////////////</span></p>
<p class="MsoNormal"><span lang="EN-US">I catch about 100M+ data in 3 days. The
catching operation is not atomic, but to a certain struct member, flags for eg,
can be seemed as atomic. There are some weird data coming up along with
unbalanced state. Such as:</span></p>
<p class="MsoNormal"><span lang="EN-US">//////////////////////////////////////////This
is a separator/////////////////////////////////////////</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:0 len:13 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:1 len:5 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:2 len:4 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:3 len:0 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300000 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:4 len:0 wakeup_other:0 wakeup_other_reds:0
woken:0 waiting:0 flags:2300000 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:5 len:3 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300010 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:6 len:1 wakeup_other:0 wakeup_other_reds:0
woken:0 waiting:0 flags:2300004 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:7 len:1 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300010 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:8 len:0 wakeup_other:797
wakeup_other_reds:0 woken:0 waiting:0 flags:2300000 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:9 len:4 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300010 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:10 len:12 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:11 len:9 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:12 len:17 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:13 len:14 wakeup_other:720
wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:14 len:14 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK
NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:15 len:3 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:16 len:7 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:17 len:11 wakeup_other:0
wakeup_other_reds:520 woken:0 waiting:0 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:18 len:14 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:19 len:10 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:1 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:20 len:10 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:1 flags:2300014 OUT_OF_WORK
HALFTIME_OUT_OF_WORK NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:red">ix:21 len:0
wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1 flags:3100000 OUT_OF_WORK
INACTIVE NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:22 len:0 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:1 flags:3100000 OUT_OF_WORK INACTIVE
NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">ix:23 len:0 wakeup_other:0
wakeup_other_reds:0 woken:0 waiting:1 flags:1100000 OUT_OF_WORK INACTIVE</span></p>
<p class="MsoNormal"><span lang="EN-US">no_runqs:180015 run_q:24 empty_q:1</span></p>
<p class="MsoNormal"><span lang="EN-US">//////////////////////////////////////////This
is a separator/////////////////////////////////////////</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US"><b>Analysis:</b></span></p>
<p class="MsoNormal"><span lang="EN-US">No_runqs is formatted in hex, 180015 means
24 online/balance runq and 21 active runq. For runq ix=21 (22th), it is with
flag INACTIVE|NONEMTPY, prefix ERTS_RUNQ_FLG_ is omitted for short. I read the
R16B03 wakeup code and find there is something wrong in function chk_wake_sched
when a to-be-waked runq has flag INACTIVE|NONEMPTY.</span></p>
<p class="MsoNormal"><span lang="EN-US">//////////////////////////////////////////This
is a separator/////////////////////////////////////////</span></p>
<p class="MsoNormal"><span lang="EN-US">static void</span></p>
<p class="MsoNormal"><span lang="EN-US">wake_scheduler_on_empty_runq(ErtsRunQueue
*crq)</span></p>
<p class="MsoNormal"><span lang="EN-US">{</span></p>
<p class="MsoNormal"><span lang="EN-US">
int ix = crq->ix;</span></p>
<p class="MsoNormal"><span lang="EN-US">
int stop_ix = ix;</span></p>
<p class="MsoNormal"><span lang="EN-US">
int active_ix, balance_ix;</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">
get_no_runqs(&active_ix, &balance_ix);</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">
if (active_ix > balance_ix)</span></p>
<p class="MsoNormal"><span lang="EN-US"> active_ix
= balance_ix;</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">
if (ix >= active_ix)</span></p>
<p class="MsoNormal"><span lang="EN-US"> stop_ix
= ix = active_ix;</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">
/* Try to wake a scheduler on an active run queue */</span></p>
<p class="MsoNormal"><span lang="EN-US">
while (1) {</span></p>
<p class="MsoNormal"><span lang="EN-US"> <span style="color:red">A:</span>ix--;</span></p>
<p class="MsoNormal"><span lang="EN-US"> if
(ix < 0) {</span></p>
<p class="MsoNormal"><span lang="EN-US"> if (active_ix == stop_ix)</span></p>
<p class="MsoNormal"><span lang="EN-US"> <span style="color:red">B:</span> break;</span></p>
<p class="MsoNormal"><span lang="EN-US"> <span style="color:red">C:</span> ix =
active_ix - 1;</span></p>
<p class="MsoNormal"><span lang="EN-US"> }</span></p>
<p class="MsoNormal"><span lang="EN-US"> if
(ix == stop_ix)</span></p>
<p class="MsoNormal"><span lang="EN-US"> <span style="color:red">D:</span> break;</span></p>
<p class="MsoNormal"><span lang="EN-US"> <span style="color:red">E:</span> if (chk_wake_sched(crq, ix, 0))</span></p>
<p class="MsoNormal"><span lang="EN-US"> <span style="color:red">F:</span> return;</span></p>
<p class="MsoNormal"><span lang="EN-US"> }</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">
if (active_ix < balance_ix) {</span></p>
<p class="MsoNormal"><span lang="EN-US"> /*
Try to activate a new run queue and wake its scheduler */</span></p>
<p class="MsoNormal"><span lang="EN-US"> <span style="color:red">G:</span> (void) chk_wake_sched(crq, active_ix, 1);</span></p>
<p class="MsoNormal"><span lang="EN-US"> }</span></p>
<p class="MsoNormal"><span lang="EN-US">}</span></p>
<p class="MsoNormal"><span lang="EN-US">static ERTS_INLINE int</span></p>
<p class="MsoNormal"><span lang="EN-US">chk_wake_sched(ErtsRunQueue *crq, int ix,
int activate)</span></p>
<p class="MsoNormal"><span lang="EN-US">{</span></p>
<p class="MsoNormal"><span lang="EN-US">
Uint32 flags;</span></p>
<p class="MsoNormal"><span lang="EN-US">
ErtsRunQueue *wrq;</span></p>
<p class="MsoNormal"><span lang="EN-US">
if (crq->ix == ix)</span></p>
<p class="MsoNormal"><span lang="EN-US"> return
0;</span></p>
<p class="MsoNormal"><span lang="EN-US">
wrq = ERTS_RUNQ_IX(ix);</span></p>
<p class="MsoNormal"><span lang="EN-US">
flags = ERTS_RUNQ_FLGS_GET(wrq);</span></p>
<p class="MsoNormal"><span lang="EN-US"> <span style="color:red">H: </span>if (!(flags &
(ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY))) {</span></p>
<p class="MsoNormal" style="margin-left:21pt"><span lang="EN-US"> if (activate) {</span></p>
<p class="MsoNormal" style="margin-left:21pt"><span lang="EN-US"> if (try_inc_no_active_runqs(ix+1))</span></p>
<p class="MsoNormal" style="margin-left:21pt"><span lang="EN-US"> (void)
ERTS_RUNQ_FLGS_UNSET(wrq, ERTS_RUNQ_FLG_INACTIVE);</span></p>
<p class="MsoNormal" style="margin-left:21pt"><span lang="EN-US"> }</span></p>
<p class="MsoNormal" style="margin-left:21pt"><span lang="EN-US" style="color:red">I:</span><span lang="EN-US"> wake_scheduler(wrq, 0);</span></p>
<p class="MsoNormal" style="margin-left:21pt"><span lang="EN-US"> return 1;</span></p>
<p class="MsoNormal" style="margin-left:21pt"><span lang="EN-US"> }</span></p>
<p class="MsoNormal"><span lang="EN-US"> <span style="color:red"> J: </span>return 0;</span></p>
<p class="MsoNormal"><span lang="EN-US">}</span></p>
<p class="MsoNormal"><span lang="EN-US">//////////////////////////////////////////This
is a separator/////////////////////////////////////////</span></p><p class="MsoNormal"><span lang="EN-US"><br></span></p>
<p class="MsoNormal"><span lang="EN-US"><b>Root cause:</b></span></p>
<p class="MsoNormal"><span lang="EN-US">A possible execution path:</span></p>
<p class="MsoNormal"><span lang="EN-US">Step 1,scheduler with runq ix=10 calls wakeup_other_check,
then fall through to wake_scheduler_on_empty_runq, active_ix=21 stop_ix=10</span></p>
<p class="MsoNormal"><span lang="EN-US">Step 2,A->E->H->J loops 10 times</span></p>
<p class="MsoNormal"><span lang="EN-US">Step 3,A(ix=-1)->C(ix=20)->E->H->J</span></p>
<p class="MsoNormal"><span lang="EN-US">Step 4,A->E->H->J loops 9 times</span></p>
<p class="MsoNormal"><span lang="EN-US">Step 5,A(ix=10)->D->G(active_ix=21)->H(ix=21)->J</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">The problem happens in step 5 H->J, the
runq ix=21 (22th) is with flag NONEMPTY|INACTIVE, so it skips “H:if (!(flags
& (ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY)))”, and jumps to “J: return
0;”. As a result, the 22th(ix=21) runq will never be waked up, unless, another
check_balance is called and clear the NONEMPTY flag.</span></p>
<p class="MsoNormal"><span lang="EN-US">But not every check_balance call wants to
clear the NONEMPTY flag, it depends on the history & current workset , in a
word, it is hard to know when we can get rid of such unbalanced state. In out
stress test, the unbalanced state lasts from 30 minutes to several hours, it is
relative with the test case.</span></p>
<p class="MsoNormal"><span lang="EN-US">The NONEMPTY|INACTIVE combination is also
dangerous, during test it happened runq(ix=10, 11th) with such flag, only 11/24
of CPU can be used, others were idle, the clients were jammed with long
latency.</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US"><b>Where does NONEMPTY|INACTIVE flag come
from?</b></span></p>
<p class="MsoNormal"><span lang="EN-US">Two functions are involved:</span></p>
<p class="MsoNormal"><span lang="EN-US">static ERTS_INLINE void get_no_runqs(int
*active, int *used)</span></p>
<p class="MsoNormal"><span lang="EN-US">static ERTS_INLINE void set_no_active_runqs(int
active)</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">A possible path: </span></p>
<p class="MsoNormal"><span lang="EN-US">Time T1: Thread A does wakeup_check, then
call get_no_runqs, gets active_ix=24 (24 runqs are all active), and decides to
wake up runq ix=20 (21th, named as Thread W)</span></p>
<p class="MsoNormal"><span lang="EN-US">Time T2: Thread B does schedule, then call
check_balance, set active runq no to 21, set_no_active_runqs(21) the runq ix=21
(22th) is with flag INACTIVE</span></p>
<p class="MsoNormal"><span lang="EN-US">Time T3: Thread A does wake_scheduler, Thread
W is waked up from kernel futex-wait (scheduler_wait), then Thread W call
non_empty_runq(runq ix=21), flag is set to INACTIVE|NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">Time T4: Thread W cant steal process from
another runq due to INACTIVE flag, so it sleeps on futex again, with flag =
INACTIVE|NONEMPTY</span></p>
<p class="MsoNormal"><span lang="EN-US">At last, the runq ix=21 is with flag
INACTIVE|NONEMPTY and never be waked up until another lucky check_balance call
to help it get rid of such situation.</span></p>
<p class="MsoNormal"><span lang="EN-US"><font color="#ff0000" style="background-color:rgb(255,255,255)">The essence of the problem is we are using
a value(active_ix) that out of date(updated by others).</font></span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US"><b>How to fix:</b></span></p>
<p class="MsoNormal"><span lang="EN-US">1, Using mutex when involved with balance_info.no_runqs,
but the critical region will be very long, and there are too many codes to be
modified, and may degrade the performance.</span></p>
<p class="MsoNormal"><span lang="EN-US">2, Judge the INACTIVE|NONEMPTY flag in function
chk_wake_sched, it means that sometimes we are waking up(activating) a NONEMPTY
runq.</span></p>
<p class="MsoNormal"><span lang="EN-US">3, Or Another Way?</span></p>
<p class="MsoNormal"><br></p>
<p class="MsoNormal"><span lang="EN-US">Best Regards,</span></p>
<p class="MsoNormal"><span lang="EN-US">Zijia</span></p></div>