supervisor process not responding to messages ('EXIT', which_children, etc)

Tue Apr 27 01:16:04 CEST 2010

The system is Windows 2k3 service pack 2, running as a VMWare guest.

I am running Erlang as a Windows service via erlsrv, so boot script and all.
Memory consumption is stable around 25MB, CPU usage seldom exceeds 10%.

The application seems to run fine for a while, then "strange" things start
happening.
I have finally been able to catch it.  I have a main application supervisor,
a couple child
supervisors, and many workers under each 2nd level supervisor.  All 2nd
level supervisors
are permanent.  Restart strategy is one_for_one.

In this case, 3 workers died within a few seconds of each other, causing the
2nd level
supervisor to die.  The application supervisor did not restart the 2nd level
supervisor.

I started an interactive VM and connected to the service to investigate.
This is a copy
of "Info" from appmon of the application supervisor:

-------------------------------------
Node: 'n1@REDACTED, Process: <0.97.0>
[{registered_name,ac_supervisor},
 {current_function,{proc_lib,sync_wait,2}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,3},
 {messages,[{'EXIT',<0.114.0>,shutdown},
            {'EXIT',<0.101.0>,shutdown},

{'$gen_call',{<0.6225.0>,#Ref<0.0.167.237088>},which_children}]},
 {links,[<0.104.0>,<0.117.0>,<0.2213.0>,<0.98.0>,<0.95.0>]},
 {dictionary,[{'$ancestors',[<0.95.0>]},
              {'$initial_call',{supervisor,ac_supervisor,1}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.94.0>},
 {total_heap_size,754},
 {heap_size,377},
 {stack_size,21},
 {reductions,468},
 {garbage_collection,[{fullsweep_after,65535},{minor_gcs,4}]},
 {suspending,[]}]
-------------------------------------------

The messages queue concerns me.  The last message, which_children, was me
running
rpc:call(n1@REDACTED, supervisor, which_children, []) from my interactive VM.
The rpc call did
not return after many seconds, so I killed the interactive VM.  The first 2
messages appear
to be the 2nd level supervisors dying.  According to the SASL log, these
supervisors died
over 2 days ago and have never been restarted.  In testing, they have always
been restarted
immediately up to the restart limit.

For comparison, I got the same info from a "non-hung" supervisor.  The
biggest difference,
besides the empty message queue, was that current_function was {gen_server,
loop, 6}
instead of {proc_lib,sync_wait,2}.  I found the same on another stalled
supervisor.

I would love any help on what the problem is, what I can do to continue to
diagnose, what more data I can provide.

Thanks,
Garret Smith