[erlang-questions] Debugging scheduler not responding to erts_schedule_misc_aux_work

Thu Jul 12 23:48:15 CEST 2018

Confirming it's a different issue.

I'm getting one scheduler stuck (waiting infinitely):

Thread 80 (Thread 802010800 (LWP 100332/beam.smp)):
#0  0x000000080136e13c in pthread_cleanup_pop () from /lib/libthr.so.3
#1  0x00000000005f664e in wait__ (e=0x84a801520, spincount=<value optimized out>, timeout=<value optimized out>) at pthread/ethr_event.c:468
#2  0x00000000004db9b1 in erts_schedule (esdp=<value optimized out>, p=<value optimized out>, calls=<value optimized out>) at erl_threads.h:3406
#3  0x0000000000433cdc in process_main (x_reg_array=0x84d04edc0, f_reg_array=0x84d050e40) at beam/beam_emu.c:1349
#4  0x00000000004d4c4b in sched_thread_func (vesdp=0x84b413fc0) at beam/erl_process.c:8912
#5  0x00000000005f5fac in thr_wrapper (vtwd=0x7fffffffdb08) at pthread/ethread.c:118
#6  0x0000000801362855 in pthread_create () from /lib/libthr.so.3
#7  0x0000000000000000 in ?? ()

This happens when there are two outstanding requests:

erlang:statistics(garbage_collection)
erlang:statistics(scheduler_wall_time)

Both of these are stuck waiting for a response from scheduler #28.

--- Scheduler 28 ---
 IX: 27
 CPU Binding: 27
 Aux work Flags: delayed-dealloc-thr-prgr fix-alloc-dealloc fix-alloc-lower-lim async-ready GARBAGE
 Sleep Info Flags:
 Pointer: (ErtsSchedulerData *) 0x84b413fc0
 - Run Queue -
  Length: total=1, max=0, high=0, normal=1, low=0, port=0
  Misc Jobs: no
  Queue Mask: normal
  Misc Flags: out-of-work non-empty
  Pointer: (ErtsRunQueue *) 0x84aa4c280

Originally there was nothing in this scheduler queue, but I forcefully bound a process to it to see if scheduler is stuck on run queue lock (with process_flag(scheduler, 28)).

So I guess it's a rare bug I eventually hit, which is different from a message being sent to exiting process.

On 7/11/18, 09:30, "erlang-questions-bounces@REDACTED on behalf of Maxim Fedorov" <erlang-questions-bounces@REDACTED on behalf of dane@REDACTED> wrote:

    Thanks, Anthony.

    This patch (and Erlang 20.3.8, ERTS 9.3.3.1) does not fix the problem. Same issue has been observed (2 servers out of 384 stopped responding to aux work). I believe it's a different issue. 

    On 6/28/18, 18:14, "anthonym@REDACTED" <anthonym@REDACTED> wrote:

        It may or may not apply but we had a similar problem with system level
        work being scheduled on freed processes, and have been debugging it with
        Ericsson via

        https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.erlang.org_browse_ERL-2D573&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=-TFczFM6WlUROTl7JqcN8u6DmWN3djIS9xhIBlNWgJc&e=

        for the last few months.  There's a branch of 20 in one of the later
        comments which might help, and might be worth a try.

        HTH,

        -Anthony

        > I’m trying to debug some weird condition when any misc system task hangs.
        > It seems to affect OTP 20 (but not 16) on FreeBSD 10.3 and 11.
        >
        > It is a rare problem happening after 5-7 days under some load (~40% cpu
        > average on a 48 cores server).
        >
        > There is also a problem with erlang:statistics(runtime), affected by this
        > bug in FreeBSD kernel:
        > https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.freebsd.org_bugzilla_show-5Fbug.cgi-3Fid-3D227689&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=9ADTx_DQfyQVR8StEXx8AlI49seSh0G6lW-yqem4sIg&e= (so
        > statistics:runtime() always returns the same value), however I doubt it
        > can affect anything.
        >
        > What happens:  there are several calls, e.g.
        > erlang:statistics(garbage_collection), ets:all(),
        > erts_internal:system_check() and few more. All of them do
        > erts_schedule_misc_aux_work. A misc aux work item is put into every
        > scheduler queue, and it seems that all of them except one respond. VM is
        > still working, all other processes are fine, but the one that did the call
        > is waiting in erlang:gc_info/2 (or another corresponding function), with
        > counter equals to 1. Since there is no timeout in receive statement, it
        > waits forever.
        >
        > How do I debug this? Is there any way to find a scheduler that misbehaves?
        > It is one of the normal schedulers. I’m using gdb to attach to BEAM VM.
        >
        > Unfortunately, I cannot run debug VM (it is not able to handle the load).
        > _______________________________________________
        > erlang-questions mailing list
        > erlang-questions@REDACTED
        > https://urldefense.proofpoint.com/v2/url?u=http-3A__erlang.org_mailman_listinfo_erlang-2Dquestions&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=v-L1XxeW7SHFxLJvBtf9oVk7FY7XgrzBtT7WvHaYEmw&e=
        >

    _______________________________________________
    erlang-questions mailing list
    erlang-questions@REDACTED
    https://urldefense.proofpoint.com/v2/url?u=http-3A__erlang.org_mailman_listinfo_erlang-2Dquestions&d=DwIGaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=TkZaL2U9Ds0ST7HzmI58OQK0wYrvAvxL1N1SHeqJeYM&s=tv2xgDzpD8TktqBUY7ayug-gYZ2k4ZTmkRnixU24lnY&e=