finding hard to find bugs in production systems

Fri Jan 22 10:31:34 CET 2010

Hi

I have a rather large user of YXA that are experiencing problems with
beam consuming 100% CPU until they restart it, about once a month. 

What are people doing to find this kind of bugs? I seem to remember
someone writing that they dump state of all processes periodically in
their production systems - does anyone has code to that effect to share?

I suspect the bug they are experiencing appears to be something similar
to

http://old.nabble.com/100--CPU-usage-on-Mac-OS-X-Leopard-after-peer-closes-socket-td16731178.html

although it might of course be something in YXA... It is not that
particular problem although there are similarities, they are running R12
and on BSD (FreeBSD).

I don't think my user is really capable of attaching to the running
nodes and performing very much fault isolation when this happens, partly
because of lack of Erlang wizard status, and also because of urgency to
get the node back up running.

Any ideas (and especially code ;) ) will be greatly appreciated.

/Fredrik