[erlang-questions] monitor long_schedule and strange timeouts
Fri May 20 16:31:55 CEST 2016
Max Lapshin writes:
> Once per hour we get strange situation on customer server.
> Log is getting full with messages like:
> 2016-05-20 02:06:52.204 <0.300.0> flu_sys_monitor:46 Monitor:
> 2016-05-20 02:06:52.210 <0.300.0> flu_sys_monitor:46 Monitor:
> (loop_setter is a very small function that just reads message and set field
> in ets table)
> and many processes in system are getting stuck in different places like:
> Amount of erlang:statistics(total_active_tasks) is getting down from
> average 800 to 200 when such situation happens.
> I have two questions:
> 2) are there any hints how to debug situation? Is it an external of
> something internal? If customer is running some task on this server, then
> why long schedules may appear?
A. The server is slowing down for whatever reason. Use normal OS-level
monitoring tools to look for CPU, memory, or I/O hogs around the times
your long schedules occur. A backup can slow down your file accesses,
B. You may be using BIFs that perform a lot of work without yielding to
the scheduler. Unless the context of the long_schedule events give
enough hints as to what BIFs are involved, I'd use either gdb to dump
the scheduler thread stacks at frequent intervals, or run a gprof-
enabled VM and then analyze the gprof output. Other profiling tools
like perf should also work.
We have hit a number of these long-running BIFs, some have been fixed,
but not all.
A special case is listing directories with huge numbers of files in them:
that can take very long time but consumes very little CPU.
More information about the erlang-questions