supervisor process not responding to messages ('EXIT', which_children, etc)

Garret Smith <>
Tue Apr 27 01:16:04 CEST 2010

The system is Windows 2k3 service pack 2, running as a VMWare guest.

I am running Erlang as a Windows service via erlsrv, so boot script and all.
Memory consumption is stable around 25MB, CPU usage seldom exceeds 10%.

The application seems to run fine for a while, then "strange" things start
I have finally been able to catch it.  I have a main application supervisor,
a couple child
supervisors, and many workers under each 2nd level supervisor.  All 2nd
level supervisors
are permanent.  Restart strategy is one_for_one.

In this case, 3 workers died within a few seconds of each other, causing the
2nd level
supervisor to die.  The application supervisor did not restart the 2nd level

I started an interactive VM and connected to the service to investigate.
This is a copy
of "Info" from appmon of the application supervisor:

Node: ', Process: <0.97.0>


The messages queue concerns me.  The last message, which_children, was me
rpc:call(, supervisor, which_children, []) from my interactive VM.
The rpc call did
not return after many seconds, so I killed the interactive VM.  The first 2
messages appear
to be the 2nd level supervisors dying.  According to the SASL log, these
supervisors died
over 2 days ago and have never been restarted.  In testing, they have always
been restarted
immediately up to the restart limit.

For comparison, I got the same info from a "non-hung" supervisor.  The
biggest difference,
besides the empty message queue, was that current_function was {gen_server,
loop, 6}
instead of {proc_lib,sync_wait,2}.  I found the same on another stalled

I would love any help on what the problem is, what I can do to continue to
diagnose, what more data I can provide.

Garret Smith

More information about the erlang-questions mailing list