Why would a <3K heap take 300+ milliseconds to GC?

Mon May 17 08:34:41 CEST 2010

Hi, sorry this is more of a something-looks-too-weird-to-be-good thing
and not an honest bug.  We've witnessed a couple of processes on an
R13B04 node that started taking over 50 milliseconds for a single GC of
a less-than-3K heap ... then stretching to over 300 milliseconds for
(what should be) the same amount of garbage.

Things get weird the longer these two procs run.  They gradually start
triggering 'long_gc' events, where the minimum threshold was 50ms.  They
reach a plateu of roughly 1150-1350 reports/day/process for a few days,
then the # of reports/day/process goes exponential: 32K reports in about
half of one day, i.e. a GC performed for nearly every timer message
received.

The worst single GC time is 349ms.

The rest of the VM was not very busy: less than 20% CPU used on average
for an 8 core box.  On May 14th, the day that the long_gc reporting
happened nearly once/sec, each for 150-270 milliseconds each, average CPU
consumption increased only very slightly.

Does this ring a bell?  It's really, really strange behavior which (we
think) culminated into some behavior unbecoming of a well-behaved
virtual machine.  If this was a legit signal that something was going
wrong, it's worth pursuing IMHO.  I can provide the code for the
gen_server if it'd be helpful.

-Scott

Distribution of long_gc times
=============================

Frequency     Range
---------     -----
     2543      50-100 milliseconds
    21003     100-199 milliseconds
    22704     200-299 milliseconds
      417     300-349 milliseconds

What the proc does
==================

The process receives a message once/second from an external timer that
tells it to net_adm:ping/1 a remote node and then rpc:call/4 to the
application controller to get the list of apps running on the remote
node.  The proc's #state record is roughly 6 fields, contains no
binaries, and only 1 field is ever updated (with a boolean) on each
iteration.  The application being monitored is never running, so the
same code path is taken (and thus (hopefully) the same amount of
garbage) each time.

Count of # of long_gc events per day, plus random sample of monitor message
===========================================================================

NOTE: There were two processes that were reporting these odd long_gc
events, <0.283.0> and <0.285.0>.  These are the counts for only one of
those two procs.

foreach i ( ?? )         ## directories per day, numbered 03-14
  echo -n $i " "
  cat $i/*/*/* | egrep long_gc | egrep '0.283.0' | wc -l
 cat $i/*/*/* | egrep long_gc | tail -1
 echo ""
 end
03  0

04  0

05  7
{monitor,<0.283.0>,long_gc,[{timeout,57},{old_heap_block_size,2584},{heap_block_size,987},{mbuf_size,0},{stack_size,17},{old_heap_size,1387},{heap_size,26}]}

06  341
{monitor,<0.285.0>,long_gc,[{timeout,52},{old_heap_block_size,2584},{heap_block_size,2584},{mbuf_size,0},{stack_size,27},{old_heap_size,79},{heap_size,32}]}

07  1195
{monitor,<0.283.0>,long_gc,[{timeout,95},{old_heap_block_size,2584},{heap_block_size,1597},{mbuf_size,0},{stack_size,17},{old_heap_size,1387},{heap_size,26}]}

08  1152
{monitor,<0.285.0>,long_gc,[{timeout,106},{old_heap_block_size,2584},{heap_block_size,2584},{mbuf_size,0},{stack_size,27},{old_heap_size,79},{heap_size,35}]}

09  1238
{monitor,<0.285.0>,long_gc,[{timeout,162},{old_heap_block_size,2584},{heap_block_size,987},{mbuf_size,0},{stack_size,32},{old_heap_size,79},{heap_size,57}]}

10  1324
{monitor,<0.283.0>,long_gc,[{timeout,121},{old_heap_block_size,2584},{heap_block_size,987},{mbuf_size,0},{stack_size,34},{old_heap_size,1387},{heap_size,36}]}

11  1332
{monitor,<0.283.0>,long_gc,[{timeout,150},{old_heap_block_size,2584},{heap_block_size,1597},{mbuf_size,0},{stack_size,32},{old_heap_size,1387},{heap_size,32}]}

12  1340
{monitor,<0.283.0>,long_gc,[{timeout,208},{old_heap_block_size,2584},{heap_block_size,377},{mbuf_size,0},{stack_size,14},{old_heap_size,1387},{heap_size,34}]}

13  6198
{monitor,<0.283.0>,long_gc,[{timeout,174},{old_heap_block_size,2584},{heap_block_size,987},{mbuf_size,0},{stack_size,25},{old_heap_size,1387},{heap_size,608}]}

14  32540
{monitor,<0.283.0>,long_gc,[{timeout,185},{old_heap_block_size,2584},{heap_block_size,987},{mbuf_size,0},{stack_size,25},{old_heap_size,1387},{heap_size,608}]}

Stack backtrace of <0.283.0>
============================

=proc:<0.283.0>
State: Scheduled
Spawned as: proc_lib:init_p/5
Spawned by: <0.92.0>
Started: Fri Mar 26 13:26:55 2010
Message queue length: 1
Message queue: [check_status]
Number of heap fragments: 0
Heap fragment data: 0
Link list: []
Dictionary:
[{'$initial_call',{brick_clientmon,init,1}},{i_am_monitoring,{'down_app@REDACTED',gdss}},{'$ancestors',[brick_mon_sup,brick_admin_sup,brick_sup,<0.88.0>]}]
Reductions: 472532540
Stack+heap: 987
OldHeap: 2584
Heap unused: 373
OldHeap unused: 2584
Stack dump:
Program counter: 0x00002aaaaab9f2c8 (gen_server:loop/6 + 288)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00002aaac7d6c120 Return addr 0x00002aaaace58d70
(proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     infinity
y(2)     brick_clientmon
y(3)
{state,'down_app@REDACTED',gdss,#Fun<brick_admin.39.100067762>,#Fun<brick_admin.40.108326667>,#Ref<0.0.0.2895>,false}
y(4)     <0.283.0>
y(5)     <0.92.0>

0x00002aaac7d6c158 Return addr 0x0000000000867be8 (<terminate process
normally>)
y(0)     Catch 0x00002aaaace58d90 (proc_lib:init_p_do_apply/3 + 88)