[erlang-questions] Debugging scheduler not responding to erts_schedule_misc_aux_work
Thu Jun 28 22:35:30 CEST 2018
I’m trying to debug some weird condition when any misc system task hangs.
It seems to affect OTP 20 (but not 16) on FreeBSD 10.3 and 11.
It is a rare problem happening after 5-7 days under some load (~40% cpu average on a 48 cores server).
There is also a problem with erlang:statistics(runtime), affected by this bug in FreeBSD kernel: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227689 (so statistics:runtime() always returns the same value), however I doubt it can affect anything.
What happens: there are several calls, e.g. erlang:statistics(garbage_collection), ets:all(), erts_internal:system_check() and few more. All of them do erts_schedule_misc_aux_work. A misc aux work item is put into every scheduler queue, and it seems that all of them except one respond. VM is still working, all other processes are fine, but the one that did the call is waiting in erlang:gc_info/2 (or another corresponding function), with counter equals to 1. Since there is no timeout in receive statement, it waits forever.
How do I debug this? Is there any way to find a scheduler that misbehaves? It is one of the normal schedulers. I’m using gdb to attach to BEAM VM.
Unfortunately, I cannot run debug VM (it is not able to handle the load).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions