[erlang-questions] finding hard to find bugs in production systems

Valentin Micic v@REDACTED
Sat Jan 23 06:58:55 CET 2010


In my experience, the behavior you're describing could be related to process
message queue -- say, you have a process which is using selective receive
targeting one pattern but ignoring the other. Such a process may accumulate
substantial amount of messages over the time and cause increased CPU
utilization (as it gets busier and busier processing selective receive).
It can certainly help your debugging effort to narrow down the problem
scope, so I suggest that you login to the system and issue regs() which may
indicate which process has more messages than it ought to.


-----Original Message-----
From: erlang-questions@REDACTED [mailto:erlang-questions@REDACTED] On
Behalf Of Fredrik Thulin
Sent: 22 January 2010 11:32 AM
To: erlang-questions@REDACTED
Subject: [erlang-questions] finding hard to find bugs in production systems


I have a rather large user of YXA that are experiencing problems with
beam consuming 100% CPU until they restart it, about once a month. 

What are people doing to find this kind of bugs? I seem to remember
someone writing that they dump state of all processes periodically in
their production systems - does anyone has code to that effect to share?

I suspect the bug they are experiencing appears to be something similar


although it might of course be something in YXA... It is not that
particular problem although there are similarities, they are running R12
and on BSD (FreeBSD).

I don't think my user is really capable of attaching to the running
nodes and performing very much fault isolation when this happens, partly
because of lack of Erlang wizard status, and also because of urgency to
get the node back up running.

Any ideas (and especially code ;) ) will be greatly appreciated.


erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

More information about the erlang-questions mailing list