[erlang-questions] How to diagnose stuck Erlang node
Mon Nov 28 15:54:21 CET 2011
How about using some dbg?
Note : you need to be careful not to generate huge amount of debugging on
your node that could actually kill your node. I don't know much about your
system and how heavy the load on nodes is.
You can do something like :
dbg:tracer(port, dbg:trace_port(file, "/tmp/usr_drv.dbg")).
dbg:p(user_drv, [m, c]).
This way you will have a file that contains tracing of all messages and
local function calls done in process user_drv
When the node stuck we can see more details about what happened (note :
watch for the file size too)
On Mon, Nov 28, 2011 at 1:37 PM, Kirill Zaborsky <qrilka@REDACTED> wrote:
> Thanks, dennis,
> I have created a crash dump on a test machine (using halfword emulator)
> and received user_drv in waiting state with
> Program counter - 0x0000000002845f00 (user_drv:server_loop/5 + 48)
> So it's on the same instruction (but not running)
> Disassembly shows:
> 0000000002845E80: i_func_info_IaaI 0 user_drv server_loop 5
> 0000000002845EA8: allocate_init_tIy 7 5 y(0)
> 0000000002845EC0: init_y y(1)
> 0000000002845ED0: move2_xyxy x(4) y(2) x(3) y(3)
> 0000000002845EE0: move2_xyxy x(2) y(4) x(1) y(5)
> 0000000002845EF0: move_ry x(0) y(6)
> 0000000002845F00: i_loop_rec_fr f(0000000002846C80) x(0)
> 0000000002845F10: i_select_tuple_arity2_rfAfAf x(0) f(0000000002846C40) 2
> f(0000000002845F40) 3 f(0000000002846418)
> 0000000002845F40: i_get_tuple_element_rPx x(0) 0 x(1)
> 0000000002846C80: wait_f f(0000000002845F00)
> 0000000002846C90: badmatch_r x(0)
> So it's just a waiting loop. I don'see how the process could be running
> when the only ouput for some time was "ALIVE" messages every 15 minutes
> from run_erl.
> Loooks like the only way to see what was going on is to get complete crash
> dump, but it was truncated by heart :-\
> P.S. It's quite strange that crash dump shows +48
> 2011/11/28 <dennis.novikov@REDACTED>
> +48 does not point to an instruction start on a couple of 32-bit systems I
>> have access to, so I can not assist you further.
>> To get instructions dump named "user_drv.dis" in the beam process working
>> directory you can do
>> Happy bug-hunting.
>> On Mon, 28 Nov 2011 12:01:17 +0200, Kirill Zaborsky <qrilka@REDACTED>
>> I'm using halfword emulator on 64bit Ubuntu Server
>>> And the process state is not "waiting" but "running". Previous crash
>>> show the same program counter value (and user_drv in running state)
>>> Kind regards,
>>> Kirill Zaborsky
>>> 2011/11/28 Dennis Novikov <dennis.novikov@REDACTED>
>>> On Mon, 28 Nov 2011 08:44:42 +0200, Kirill Zaborsky <qrilka@REDACTED>
>>>> Trying to fins any workaround to this "stuck node" scenario I've
>>>>> to R14B04 and turned on "heart".
>>>>> But recently the node once again stopped responding. And heart did not
>>>>> assume it to be stuck although I could not contact it.
>>>>> I've tried to to get a crashdump with 'kill -USR1' but it appeared that
>>>>> once again crash dump was truncated. Does heart kills "dead" erlang
>>>>> And the only thing that could be seen from the crash dump that the only
>>>>> running process was user_drv (just like in previous times) with program
>>>>> counter equal to "user_drv:server_loop/5 + 48". Is it possible to find
>>>>> what exactly does it stands for?
>>>> Waiting on receive in that function. And you are observing this on a
>>>> 32-bit VM.
> erlang-questions mailing list
- Ahmed Omar
Follow me on twitter
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions