supervisor process not responding to messages ('EXIT', which_children, etc)
Garret Smith
garret.smith@REDACTED
Tue Apr 27 01:16:04 CEST 2010
The system is Windows 2k3 service pack 2, running as a VMWare guest.
I am running Erlang as a Windows service via erlsrv, so boot script and all.
Memory consumption is stable around 25MB, CPU usage seldom exceeds 10%.
The application seems to run fine for a while, then "strange" things start
happening.
I have finally been able to catch it. I have a main application supervisor,
a couple child
supervisors, and many workers under each 2nd level supervisor. All 2nd
level supervisors
are permanent. Restart strategy is one_for_one.
In this case, 3 workers died within a few seconds of each other, causing the
2nd level
supervisor to die. The application supervisor did not restart the 2nd level
supervisor.
I started an interactive VM and connected to the service to investigate.
This is a copy
of "Info" from appmon of the application supervisor:
-------------------------------------
Node: 'n1@REDACTED, Process: <0.97.0>
[{registered_name,ac_supervisor},
{current_function,{proc_lib,sync_wait,2}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,3},
{messages,[{'EXIT',<0.114.0>,shutdown},
{'EXIT',<0.101.0>,shutdown},
{'$gen_call',{<0.6225.0>,#Ref<0.0.167.237088>},which_children}]},
{links,[<0.104.0>,<0.117.0>,<0.2213.0>,<0.98.0>,<0.95.0>]},
{dictionary,[{'$ancestors',[<0.95.0>]},
{'$initial_call',{supervisor,ac_supervisor,1}}]},
{trap_exit,true},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.94.0>},
{total_heap_size,754},
{heap_size,377},
{stack_size,21},
{reductions,468},
{garbage_collection,[{fullsweep_after,65535},{minor_gcs,4}]},
{suspending,[]}]
-------------------------------------------
The messages queue concerns me. The last message, which_children, was me
running
rpc:call(n1@REDACTED, supervisor, which_children, []) from my interactive VM.
The rpc call did
not return after many seconds, so I killed the interactive VM. The first 2
messages appear
to be the 2nd level supervisors dying. According to the SASL log, these
supervisors died
over 2 days ago and have never been restarted. In testing, they have always
been restarted
immediately up to the restart limit.
For comparison, I got the same info from a "non-hung" supervisor. The
biggest difference,
besides the empty message queue, was that current_function was {gen_server,
loop, 6}
instead of {proc_lib,sync_wait,2}. I found the same on another stalled
supervisor.
I would love any help on what the problem is, what I can do to continue to
diagnose, what more data I can provide.
Thanks,
Garret Smith
More information about the erlang-questions
mailing list