[erlang-questions] Unbalanced Scheduler Problem caused by Using out of date active_no_runq

songlu cai caisonglu@REDACTED
Tue Oct 28 08:13:57 CET 2014


Hi,



We find an Erlang unbalanced scheduler problem on our proxy servers(for
processing and forwarding requests from clients).



*Env:*

Erlang Version: R16B03/SMP 24 schedulers online/-swt low

Mem: 64G, 1.6G occupied by beam.smp

CPU: 24 Xeon cores



*Issue Description:*

On clients, we establish some tcp keep-alive connections to proxy servers;
time to time, we make some tcp short (non keep-alive) connections to proxy
servers. Then clients send requests to proxy servers, proxy processes &
forwards requests to backend servers.

During the test, we find 11 schedulers’ usage is 100% while other 13
schedulers’ usage is 0(idle), and busy ones are with long runq (rq->len
around 100). Sometimes maybe 21 busy, 3 idle. Such state lasts for from 30
minutes to 6 hours in an uncertain way. Sometimes it disappears, after an
hour or a day, occasionly, it comes back again.



*Debug:*

So I write a gdb script to catch the runq flags and other struct members, I
want to know what happens when the problem comes up. Collect.sh is used to
catch the runq state every 2 seconds, wakeix.sh is used to catch which
runq(ix) is waked up when schedulers in the unbalanced state.

//////////////////////////////////////////This is a
separator/////////////////////////////////////////

#cat collect.sh

#!/bin/sh



while [ 1 ];

do

date



#cat wakeix.sh

#!/bin/sh

//////////////////////////////////////////This is a
separator/////////////////////////////////////////

gdb -p 9075 <<EOF

b wake_scheduler

c

print_rdi

detach

quit

EOF



#9075 is process id of beam.smp

gdb -p 9075 <<EOF

print_wakeup_reds

detach

quit

EOF



sleep 2

done

//////////////////////////////////////////This is a
separator/////////////////////////////////////////

#cat .gdbinit

define print_rdi

        set $a = *(int*)$rdi

        printf "wake ix:%d\n", $a

end



define print_wakeup_reds

        set $balance_info = (long)&balance_info

        set $a = $balance_info + 664

        set $no_runqs = *(int*)$a

        set $addr = (long)&erts_aligned_run_queues

        set $start = *(long*)$addr

        set $i = 0

        while($i<24)

                set $a1 = $i*1024+$start+0x318

                set $a2 = $i*1024+$start+0x314

                set $a3 = $i*1024+$start+0x310

                set $a4 = $i*1024+$start+0x2e0

                set $a5 = $i*1024+$start+0x2dc

                set $a6 = $i*1024+$start+0x2d8

                set $a7 = $i*1024+$start

                set $wakeup_other_reds = *(int*)$a1

                set $wakeup_other = *(int*)$a2

                set $len = *(int*)$a3

                set $flags = *(int*)$a4

                set $woken = *(int*)$a5

                set $waiting = *(int*)$a6

                set $ix = *(int*)$a7

                printf "ix:%d len:%d wakeup_other:%d wakeup_other_reds:%d
woken:%d waiting:%d flags:%x ", $ix, $len, $wakeup_other,
$wakeup_other_reds, $woken, $waiting, $flags

                parse_flags $flags

                printf "\n"

                set $i = $i+1

        end

        printf "no_runqs:%x run_q:%d empty_q:%d\n", $no_runqs,
erts_no_run_queues, no_empty_run_queues

        detach

        quit

end



define parse_flags

        set $f = $arg0

        if (($f&0x100000) != 0)

                printf "OUT_OF_WORK "

        end

        if (($f&0x200000) != 0)

                printf "HALFTIME_OUT_OF_WORK "

        end

        if (($f&0x400000) != 0)

                printf "SUSPENDED "

        end

        if (($f&0x800000) != 0)

                printf "CHK_CPU_BIND "

        end

        if (($f&0x1000000) != 0)

                printf "INACTIVE "

        end

        if (($f&0x2000000) != 0)

                printf "NONEMPTY "

        end

        if (($f&0x4000000) != 0)

                printf "PROTECTED"

        end

end

//////////////////////////////////////////This is a
separator/////////////////////////////////////////

I catch about 100M+ data in 3 days. The catching operation is not atomic,
but to a certain struct member, flags for eg, can be seemed as atomic.
There are some weird data coming up along with unbalanced state. Such as:

//////////////////////////////////////////This is a
separator/////////////////////////////////////////

ix:0 len:13 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:1 len:5 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:2 len:4 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:3 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300000 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:4 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300000 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:5 len:3 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300010 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:6 len:1 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300004 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:7 len:1 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300010 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:8 len:0 wakeup_other:797 wakeup_other_reds:0 woken:0 waiting:0
flags:2300000 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:9 len:4 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300010 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:10 len:12 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:11 len:9 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:12 len:17 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:13 len:14 wakeup_other:720 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:14 len:14 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:15 len:3 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:16 len:7 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:17 len:11 wakeup_other:0 wakeup_other_reds:520 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:18 len:14 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:19 len:10 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:20 len:10 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1
flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:21 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1
flags:3100000 OUT_OF_WORK INACTIVE NONEMPTY

ix:22 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1
flags:3100000 OUT_OF_WORK INACTIVE NONEMPTY

ix:23 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1
flags:1100000 OUT_OF_WORK INACTIVE

no_runqs:180015 run_q:24 empty_q:1

//////////////////////////////////////////This is a
separator/////////////////////////////////////////



*Analysis:*

No_runqs is formatted in hex, 180015 means 24 online/balance runq and 21
active runq. For runq ix=21 (22th), it is with flag INACTIVE|NONEMTPY,
prefix ERTS_RUNQ_FLG_ is omitted for short. I read the R16B03 wakeup code
and find there is something wrong in function chk_wake_sched when a
to-be-waked runq has flag INACTIVE|NONEMPTY.

//////////////////////////////////////////This is a
separator/////////////////////////////////////////

static void

wake_scheduler_on_empty_runq(ErtsRunQueue *crq)

{

    int ix = crq->ix;

    int stop_ix = ix;

    int active_ix, balance_ix;



    get_no_runqs(&active_ix, &balance_ix);



    if (active_ix > balance_ix)

         active_ix = balance_ix;



    if (ix >= active_ix)

         stop_ix = ix = active_ix;



    /* Try to wake a scheduler on an active run queue */

    while (1) {

         A:ix--;

         if (ix < 0) {

             if (active_ix == stop_ix)

         B:     break;

         C:    ix = active_ix - 1;

         }

         if (ix == stop_ix)

         D:    break;

         E: if (chk_wake_sched(crq, ix, 0))

         F:    return;

    }



    if (active_ix < balance_ix) {

         /* Try to activate a new run queue and wake its scheduler */

         G: (void) chk_wake_sched(crq, active_ix, 1);

    }

}

static ERTS_INLINE int

chk_wake_sched(ErtsRunQueue *crq, int ix, int activate)

{

    Uint32 flags;

    ErtsRunQueue *wrq;

    if (crq->ix == ix)

         return 0;

    wrq = ERTS_RUNQ_IX(ix);

    flags = ERTS_RUNQ_FLGS_GET(wrq);

    H: if (!(flags & (ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY))) {

         if (activate) {

             if (try_inc_no_active_runqs(ix+1))

                   (void) ERTS_RUNQ_FLGS_UNSET(wrq, ERTS_RUNQ_FLG_INACTIVE);

         }

I:       wake_scheduler(wrq, 0);

         return 1;

    }

    J: return 0;

}

//////////////////////////////////////////This is a
separator/////////////////////////////////////////


*Root cause:*

A possible execution path:

Step 1,scheduler with runq ix=10 calls wakeup_other_check, then fall
through to wake_scheduler_on_empty_runq, active_ix=21 stop_ix=10

Step 2,A->E->H->J loops 10 times

Step 3,A(ix=-1)->C(ix=20)->E->H->J

Step 4,A->E->H->J loops 9 times

Step 5,A(ix=10)->D->G(active_ix=21)->H(ix=21)->J



The problem happens in step 5 H->J, the runq ix=21 (22th) is with flag
NONEMPTY|INACTIVE, so it skips “H:if (!(flags &
(ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY)))”, and jumps to “J:
return 0;”. As a result, the 22th(ix=21) runq will never be waked up,
unless, another check_balance is called and clear the NONEMPTY flag.

But not every check_balance call wants to clear the NONEMPTY flag, it
depends on the history & current workset , in a word, it is hard to know
when we can get rid of such unbalanced state. In out stress test, the
unbalanced state lasts from 30 minutes to several hours, it is relative
with the test case.

The NONEMPTY|INACTIVE combination is also dangerous, during test it
happened runq(ix=10, 11th) with such flag, only 11/24 of CPU can be used,
others were idle, the clients were jammed with long latency.



*Where does NONEMPTY|INACTIVE flag come from?*

Two functions are involved:

static ERTS_INLINE void get_no_runqs(int *active, int *used)

static ERTS_INLINE void set_no_active_runqs(int active)



A possible path:

Time T1: Thread A does wakeup_check, then call get_no_runqs, gets
active_ix=24 (24 runqs are all active), and decides to wake up runq ix=20
(21th, named as Thread W)

Time T2: Thread B does schedule, then call check_balance, set active runq
no to 21, set_no_active_runqs(21) the runq ix=21 (22th) is with flag
INACTIVE

Time T3: Thread A does wake_scheduler, Thread W is waked up from kernel
futex-wait (scheduler_wait), then Thread W call non_empty_runq(runq ix=21),
flag is set to INACTIVE|NONEMPTY

Time T4: Thread W cant steal process from another runq due to INACTIVE
flag, so it sleeps on futex again, with flag = INACTIVE|NONEMPTY

At last, the runq ix=21 is with flag INACTIVE|NONEMPTY and never be waked
up until another lucky check_balance call to help it get rid of such
situation.

The essence of the problem is we are using a value(active_ix) that out of
date(updated by others).



*How to fix:*

1, Using mutex when involved with balance_info.no_runqs, but the critical
region will be very long, and there are too many codes to be modified, and
may degrade the performance.

2, Judge the INACTIVE|NONEMPTY flag in function chk_wake_sched, it means
that sometimes we are waking up(activating) a NONEMPTY runq.

3, Or Another Way?


Best Regards,

Zijia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141028/ddbf7b05/attachment.htm>


More information about the erlang-questions mailing list