[erlang-questions] monitor long_schedule and strange timeouts

Fri May 20 16:31:55 CEST 2016

Max Lapshin writes:
 > Once per hour we get strange situation on customer server.
 > 
 > Log is getting full with messages like:
 > 
 > 
 > 2016-05-20 02:06:52.204 <0.300.0> flu_sys_monitor:46 Monitor:
 > {monitor,<0.27485.22>,long_schedule,[{timeout,2591},{in,{gen_server,loop,6}},{out,{gen_server,loop,6}}]}
 > 
 > 2016-05-20 02:06:52.210 <0.300.0> flu_sys_monitor:46 Monitor:
 > {monitor,<0.342.0>,long_schedule,[{timeout,2595},{in,{live_info_storage,loop_setter,1}},{out,{live_info_storage,loop_setter,1}}]}
 > 
 > (loop_setter is a very small function that just reads message and set field
 > in ets table)
 > 
 > and many processes in system are getting stuck in different places like:
 > 
 > 
 > {current_stacktrace,[{erts_internal,await_result,1,[]}
 > 
 > 
 > Amount of erlang:statistics(total_active_tasks) is getting down from
 > average 800 to 200 when such situation happens.
 > 
 > 
 > I have two questions:

[snip]

 > 2) are there any hints how to debug situation? Is it an external of
 > something internal? If customer is running some task on this server, then
 > why long schedules may appear?

A. The server is slowing down for whatever reason.  Use normal OS-level
   monitoring tools to look for CPU, memory, or I/O hogs around the times
   your long schedules occur.  A backup can slow down your file accesses,
   for instance.

B. You may be using BIFs that perform a lot of work without yielding to
   the scheduler.  Unless the context of the long_schedule events give
   enough hints as to what BIFs are involved, I'd use either gdb to dump
   the scheduler thread stacks at frequent intervals, or run a gprof-
   enabled VM and then analyze the gprof output.  Other profiling tools
   like perf should also work.

   We have hit a number of these long-running BIFs, some have been fixed,
   but not all.

   A special case is listing directories with huge numbers of files in them:
   that can take very long time but consumes very little CPU.