[erlang-questions] Unbalanced Scheduler Problem caused by Using out of date active_no_runq
songlu cai
caisonglu@REDACTED
Tue Oct 28 08:13:57 CET 2014
Hi,
We find an Erlang unbalanced scheduler problem on our proxy servers(for
processing and forwarding requests from clients).
*Env:*
Erlang Version: R16B03/SMP 24 schedulers online/-swt low
Mem: 64G, 1.6G occupied by beam.smp
CPU: 24 Xeon cores
*Issue Description:*
On clients, we establish some tcp keep-alive connections to proxy servers;
time to time, we make some tcp short (non keep-alive) connections to proxy
servers. Then clients send requests to proxy servers, proxy processes &
forwards requests to backend servers.
During the test, we find 11 schedulers’ usage is 100% while other 13
schedulers’ usage is 0(idle), and busy ones are with long runq (rq->len
around 100). Sometimes maybe 21 busy, 3 idle. Such state lasts for from 30
minutes to 6 hours in an uncertain way. Sometimes it disappears, after an
hour or a day, occasionly, it comes back again.
*Debug:*
So I write a gdb script to catch the runq flags and other struct members, I
want to know what happens when the problem comes up. Collect.sh is used to
catch the runq state every 2 seconds, wakeix.sh is used to catch which
runq(ix) is waked up when schedulers in the unbalanced state.
//////////////////////////////////////////This is a
separator/////////////////////////////////////////
#cat collect.sh
#!/bin/sh
while [ 1 ];
do
date
#cat wakeix.sh
#!/bin/sh
//////////////////////////////////////////This is a
separator/////////////////////////////////////////
gdb -p 9075 <<EOF
b wake_scheduler
c
print_rdi
detach
quit
EOF
#9075 is process id of beam.smp
gdb -p 9075 <<EOF
print_wakeup_reds
detach
quit
EOF
sleep 2
done
//////////////////////////////////////////This is a
separator/////////////////////////////////////////
#cat .gdbinit
define print_rdi
set $a = *(int*)$rdi
printf "wake ix:%d\n", $a
end
define print_wakeup_reds
set $balance_info = (long)&balance_info
set $a = $balance_info + 664
set $no_runqs = *(int*)$a
set $addr = (long)&erts_aligned_run_queues
set $start = *(long*)$addr
set $i = 0
while($i<24)
set $a1 = $i*1024+$start+0x318
set $a2 = $i*1024+$start+0x314
set $a3 = $i*1024+$start+0x310
set $a4 = $i*1024+$start+0x2e0
set $a5 = $i*1024+$start+0x2dc
set $a6 = $i*1024+$start+0x2d8
set $a7 = $i*1024+$start
set $wakeup_other_reds = *(int*)$a1
set $wakeup_other = *(int*)$a2
set $len = *(int*)$a3
set $flags = *(int*)$a4
set $woken = *(int*)$a5
set $waiting = *(int*)$a6
set $ix = *(int*)$a7
printf "ix:%d len:%d wakeup_other:%d wakeup_other_reds:%d
woken:%d waiting:%d flags:%x ", $ix, $len, $wakeup_other,
$wakeup_other_reds, $woken, $waiting, $flags
parse_flags $flags
printf "\n"
set $i = $i+1
end
printf "no_runqs:%x run_q:%d empty_q:%d\n", $no_runqs,
erts_no_run_queues, no_empty_run_queues
detach
quit
end
define parse_flags
set $f = $arg0
if (($f&0x100000) != 0)
printf "OUT_OF_WORK "
end
if (($f&0x200000) != 0)
printf "HALFTIME_OUT_OF_WORK "
end
if (($f&0x400000) != 0)
printf "SUSPENDED "
end
if (($f&0x800000) != 0)
printf "CHK_CPU_BIND "
end
if (($f&0x1000000) != 0)
printf "INACTIVE "
end
if (($f&0x2000000) != 0)
printf "NONEMPTY "
end
if (($f&0x4000000) != 0)
printf "PROTECTED"
end
end
//////////////////////////////////////////This is a
separator/////////////////////////////////////////
I catch about 100M+ data in 3 days. The catching operation is not atomic,
but to a certain struct member, flags for eg, can be seemed as atomic.
There are some weird data coming up along with unbalanced state. Such as:
//////////////////////////////////////////This is a
separator/////////////////////////////////////////
ix:0 len:13 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:1 len:5 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:2 len:4 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:3 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300000 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:4 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300000 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:5 len:3 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300010 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:6 len:1 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300004 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:7 len:1 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300010 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:8 len:0 wakeup_other:797 wakeup_other_reds:0 woken:0 waiting:0
flags:2300000 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:9 len:4 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300010 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:10 len:12 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:11 len:9 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:12 len:17 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:13 len:14 wakeup_other:720 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:14 len:14 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:15 len:3 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:16 len:7 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:17 len:11 wakeup_other:0 wakeup_other_reds:520 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:18 len:14 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:19 len:10 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:20 len:10 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY
ix:21 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1
flags:3100000 OUT_OF_WORK INACTIVE NONEMPTY
ix:22 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1
flags:3100000 OUT_OF_WORK INACTIVE NONEMPTY
ix:23 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1
flags:1100000 OUT_OF_WORK INACTIVE
no_runqs:180015 run_q:24 empty_q:1
//////////////////////////////////////////This is a
separator/////////////////////////////////////////
*Analysis:*
No_runqs is formatted in hex, 180015 means 24 online/balance runq and 21
active runq. For runq ix=21 (22th), it is with flag INACTIVE|NONEMTPY,
prefix ERTS_RUNQ_FLG_ is omitted for short. I read the R16B03 wakeup code
and find there is something wrong in function chk_wake_sched when a
to-be-waked runq has flag INACTIVE|NONEMPTY.
//////////////////////////////////////////This is a
separator/////////////////////////////////////////
static void
wake_scheduler_on_empty_runq(ErtsRunQueue *crq)
{
int ix = crq->ix;
int stop_ix = ix;
int active_ix, balance_ix;
get_no_runqs(&active_ix, &balance_ix);
if (active_ix > balance_ix)
active_ix = balance_ix;
if (ix >= active_ix)
stop_ix = ix = active_ix;
/* Try to wake a scheduler on an active run queue */
while (1) {
A:ix--;
if (ix < 0) {
if (active_ix == stop_ix)
B: break;
C: ix = active_ix - 1;
}
if (ix == stop_ix)
D: break;
E: if (chk_wake_sched(crq, ix, 0))
F: return;
}
if (active_ix < balance_ix) {
/* Try to activate a new run queue and wake its scheduler */
G: (void) chk_wake_sched(crq, active_ix, 1);
}
}
static ERTS_INLINE int
chk_wake_sched(ErtsRunQueue *crq, int ix, int activate)
{
Uint32 flags;
ErtsRunQueue *wrq;
if (crq->ix == ix)
return 0;
wrq = ERTS_RUNQ_IX(ix);
flags = ERTS_RUNQ_FLGS_GET(wrq);
H: if (!(flags & (ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY))) {
if (activate) {
if (try_inc_no_active_runqs(ix+1))
(void) ERTS_RUNQ_FLGS_UNSET(wrq, ERTS_RUNQ_FLG_INACTIVE);
}
I: wake_scheduler(wrq, 0);
return 1;
}
J: return 0;
}
//////////////////////////////////////////This is a
separator/////////////////////////////////////////
*Root cause:*
A possible execution path:
Step 1,scheduler with runq ix=10 calls wakeup_other_check, then fall
through to wake_scheduler_on_empty_runq, active_ix=21 stop_ix=10
Step 2,A->E->H->J loops 10 times
Step 3,A(ix=-1)->C(ix=20)->E->H->J
Step 4,A->E->H->J loops 9 times
Step 5,A(ix=10)->D->G(active_ix=21)->H(ix=21)->J
The problem happens in step 5 H->J, the runq ix=21 (22th) is with flag
NONEMPTY|INACTIVE, so it skips “H:if (!(flags &
(ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY)))”, and jumps to “J:
return 0;”. As a result, the 22th(ix=21) runq will never be waked up,
unless, another check_balance is called and clear the NONEMPTY flag.
But not every check_balance call wants to clear the NONEMPTY flag, it
depends on the history & current workset , in a word, it is hard to know
when we can get rid of such unbalanced state. In out stress test, the
unbalanced state lasts from 30 minutes to several hours, it is relative
with the test case.
The NONEMPTY|INACTIVE combination is also dangerous, during test it
happened runq(ix=10, 11th) with such flag, only 11/24 of CPU can be used,
others were idle, the clients were jammed with long latency.
*Where does NONEMPTY|INACTIVE flag come from?*
Two functions are involved:
static ERTS_INLINE void get_no_runqs(int *active, int *used)
static ERTS_INLINE void set_no_active_runqs(int active)
A possible path:
Time T1: Thread A does wakeup_check, then call get_no_runqs, gets
active_ix=24 (24 runqs are all active), and decides to wake up runq ix=20
(21th, named as Thread W)
Time T2: Thread B does schedule, then call check_balance, set active runq
no to 21, set_no_active_runqs(21) the runq ix=21 (22th) is with flag
INACTIVE
Time T3: Thread A does wake_scheduler, Thread W is waked up from kernel
futex-wait (scheduler_wait), then Thread W call non_empty_runq(runq ix=21),
flag is set to INACTIVE|NONEMPTY
Time T4: Thread W cant steal process from another runq due to INACTIVE
flag, so it sleeps on futex again, with flag = INACTIVE|NONEMPTY
At last, the runq ix=21 is with flag INACTIVE|NONEMPTY and never be waked
up until another lucky check_balance call to help it get rid of such
situation.
The essence of the problem is we are using a value(active_ix) that out of
date(updated by others).
*How to fix:*
1, Using mutex when involved with balance_info.no_runqs, but the critical
region will be very long, and there are too many codes to be modified, and
may degrade the performance.
2, Judge the INACTIVE|NONEMPTY flag in function chk_wake_sched, it means
that sometimes we are waking up(activating) a NONEMPTY runq.
3, Or Another Way?
Best Regards,
Zijia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141028/ddbf7b05/attachment.htm>
More information about the erlang-questions
mailing list