[erlang-questions] Debugging scheduler not responding to erts_schedule_misc_aux_work

Maxim Fedorov <>
Wed Jul 11 18:29:41 CEST 2018


Thanks, Anthony.

This patch (and Erlang 20.3.8, ERTS 9.3.3.1) does not fix the problem. Same issue has been observed (2 servers out of 384 stopped responding to aux work). I believe it's a different issue. 

On 6/28/18, 18:14, "" <> wrote:

    It may or may not apply but we had a similar problem with system level
    work being scheduled on freed processes, and have been debugging it with
    Ericsson via
    
    https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.erlang.org_browse_ERL-2D573&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=-TFczFM6WlUROTl7JqcN8u6DmWN3djIS9xhIBlNWgJc&e=
    
    for the last few months.  There's a branch of 20 in one of the later
    comments which might help, and might be worth a try.
    
    HTH,
    
    -Anthony
    
    
    > I’m trying to debug some weird condition when any misc system task hangs.
    > It seems to affect OTP 20 (but not 16) on FreeBSD 10.3 and 11.
    >
    > It is a rare problem happening after 5-7 days under some load (~40% cpu
    > average on a 48 cores server).
    >
    > There is also a problem with erlang:statistics(runtime), affected by this
    > bug in FreeBSD kernel:
    > https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.freebsd.org_bugzilla_show-5Fbug.cgi-3Fid-3D227689&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=9ADTx_DQfyQVR8StEXx8AlI49seSh0G6lW-yqem4sIg&e= (so
    > statistics:runtime() always returns the same value), however I doubt it
    > can affect anything.
    >
    > What happens:  there are several calls, e.g.
    > erlang:statistics(garbage_collection), ets:all(),
    > erts_internal:system_check() and few more. All of them do
    > erts_schedule_misc_aux_work. A misc aux work item is put into every
    > scheduler queue, and it seems that all of them except one respond. VM is
    > still working, all other processes are fine, but the one that did the call
    > is waiting in erlang:gc_info/2 (or another corresponding function), with
    > counter equals to 1. Since there is no timeout in receive statement, it
    > waits forever.
    >
    > How do I debug this? Is there any way to find a scheduler that misbehaves?
    > It is one of the normal schedulers. I’m using gdb to attach to BEAM VM.
    >
    > Unfortunately, I cannot run debug VM (it is not able to handle the load).
    > _______________________________________________
    > erlang-questions mailing list
    > 
    > https://urldefense.proofpoint.com/v2/url?u=http-3A__erlang.org_mailman_listinfo_erlang-2Dquestions&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=v-L1XxeW7SHFxLJvBtf9oVk7FY7XgrzBtT7WvHaYEmw&e=
    >
    
    
    



More information about the erlang-questions mailing list