[erlang-questions] How to diagnose stuck Erlang node

Kirill Zaborsky qrilka@REDACTED
Mon Nov 28 13:37:49 CET 2011


Thanks, dennis,
I have created a crash dump on a test machine (using halfword emulator) and
received user_drv in waiting state with
Program counter - 0x0000000002845f00 (user_drv:server_loop/5 + 48)
So it's on the same instruction (but not running)

Disassembly shows:
-----------------
0000000002845E80: i_func_info_IaaI 0 user_drv server_loop 5
0000000002845EA8: allocate_init_tIy 7 5 y(0)
0000000002845EC0: init_y y(1)
0000000002845ED0: move2_xyxy x(4) y(2) x(3) y(3)
0000000002845EE0: move2_xyxy x(2) y(4) x(1) y(5)
0000000002845EF0: move_ry x(0) y(6)
0000000002845F00: i_loop_rec_fr f(0000000002846C80) x(0)
0000000002845F10: i_select_tuple_arity2_rfAfAf x(0) f(0000000002846C40) 2
f(0000000002845F40) 3 f(0000000002846418)
0000000002845F40: i_get_tuple_element_rPx x(0) 0 x(1)
.....
0000000002846C80: wait_f f(0000000002845F00)
0000000002846C90: badmatch_r x(0)
-----------------
So it's just a waiting loop. I don'see how the process could be running
when the only ouput for some time was "ALIVE" messages every 15 minutes
from run_erl.
Loooks like the only way to see what was going on is to get complete crash
dump, but it was truncated by heart :-\

P.S. It's quite strange that crash dump shows +48

2011/11/28 <dennis.novikov@REDACTED>

> +48 does not point to an instruction start on a couple of 32-bit systems I
> have access to, so I can not assist you further.
>
> To get instructions dump named "user_drv.dis" in the beam process working
> directory you can do
>
> erts_debug:df(user_drv).
>
> Happy bug-hunting.
>
>
>
> On Mon, 28 Nov 2011 12:01:17 +0200, Kirill Zaborsky <qrilka@REDACTED>
> wrote:
>
>  I'm using halfword emulator on 64bit Ubuntu Server
>> And the process state is not "waiting" but "running". Previous crash dumps
>> show the same program counter value (and user_drv in running state)
>>
>> Kind regards,
>> Kirill Zaborsky
>>
>>
>> 2011/11/28 Dennis Novikov <dennis.novikov@REDACTED>
>>
>>  On Mon, 28 Nov 2011 08:44:42 +0200, Kirill Zaborsky <qrilka@REDACTED>
>>> wrote:
>>>
>>>  Trying to fins any workaround to this "stuck node" scenario I've
>>> upgraded
>>>
>>>> to R14B04 and turned on "heart".
>>>> But recently  the node once again stopped responding. And heart did not
>>>> assume it to be stuck although I could not contact it.
>>>> I've tried to to get a crashdump with 'kill -USR1' but it appeared that
>>>> once again crash dump was truncated. Does heart kills "dead" erlang
>>>> node?
>>>> And the only thing that could be seen from the crash dump that the only
>>>> running process was user_drv (just like in previous times) with program
>>>> counter equal to "user_drv:server_loop/5 + 48". Is it possible to find
>>>> out
>>>> what exactly does it stands for?
>>>>
>>>>
>>> Waiting on receive in that function. And you are observing this on a
>>> 32-bit VM.
>>>
>>> --
>>> WBR,
>>>  DN
>>>
>>>
>
> --
> WBR,
>  DN
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111128/a2fb292d/attachment.htm>


More information about the erlang-questions mailing list