[erlang-questions] How to diagnose stuck Erlang node

Ahmed Omar spawn.think@REDACTED
Mon Nov 28 15:54:21 CET 2011


Krill,
How about using some dbg?
Note : you need to be careful not to generate huge amount of debugging on
your node that could actually kill your node. I don't know much about your
system and how heavy the load on nodes is.

You can do something like :
dbg:tracer(port, dbg:trace_port(file, "/tmp/usr_drv.dbg")).
dbg:p(user_drv, [m, c]).
dbg:tpl(user_drv, x).

This way you will have a file that contains tracing of all messages and
local function calls done in process user_drv
When the node stuck we can see more details about what happened (note :
watch for the file size too)

On Mon, Nov 28, 2011 at 1:37 PM, Kirill Zaborsky <qrilka@REDACTED> wrote:

> Thanks, dennis,
> I have created a crash dump on a test machine (using halfword emulator)
> and received user_drv in waiting state with
> Program counter - 0x0000000002845f00 (user_drv:server_loop/5 + 48)
> So it's on the same instruction (but not running)
>
> Disassembly shows:
> -----------------
> 0000000002845E80: i_func_info_IaaI 0 user_drv server_loop 5
> 0000000002845EA8: allocate_init_tIy 7 5 y(0)
> 0000000002845EC0: init_y y(1)
> 0000000002845ED0: move2_xyxy x(4) y(2) x(3) y(3)
> 0000000002845EE0: move2_xyxy x(2) y(4) x(1) y(5)
> 0000000002845EF0: move_ry x(0) y(6)
> 0000000002845F00: i_loop_rec_fr f(0000000002846C80) x(0)
> 0000000002845F10: i_select_tuple_arity2_rfAfAf x(0) f(0000000002846C40) 2
> f(0000000002845F40) 3 f(0000000002846418)
> 0000000002845F40: i_get_tuple_element_rPx x(0) 0 x(1)
> .....
> 0000000002846C80: wait_f f(0000000002845F00)
> 0000000002846C90: badmatch_r x(0)
> -----------------
> So it's just a waiting loop. I don'see how the process could be running
> when the only ouput for some time was "ALIVE" messages every 15 minutes
> from run_erl.
> Loooks like the only way to see what was going on is to get complete crash
> dump, but it was truncated by heart :-\
>
> P.S. It's quite strange that crash dump shows +48
>
> 2011/11/28 <dennis.novikov@REDACTED>
>
> +48 does not point to an instruction start on a couple of 32-bit systems I
>> have access to, so I can not assist you further.
>>
>> To get instructions dump named "user_drv.dis" in the beam process working
>> directory you can do
>>
>> erts_debug:df(user_drv).
>>
>> Happy bug-hunting.
>>
>>
>>
>> On Mon, 28 Nov 2011 12:01:17 +0200, Kirill Zaborsky <qrilka@REDACTED>
>> wrote:
>>
>>  I'm using halfword emulator on 64bit Ubuntu Server
>>> And the process state is not "waiting" but "running". Previous crash
>>> dumps
>>> show the same program counter value (and user_drv in running state)
>>>
>>> Kind regards,
>>> Kirill Zaborsky
>>>
>>>
>>> 2011/11/28 Dennis Novikov <dennis.novikov@REDACTED>
>>>
>>>  On Mon, 28 Nov 2011 08:44:42 +0200, Kirill Zaborsky <qrilka@REDACTED>
>>>> wrote:
>>>>
>>>>  Trying to fins any workaround to this "stuck node" scenario I've
>>>> upgraded
>>>>
>>>>> to R14B04 and turned on "heart".
>>>>> But recently  the node once again stopped responding. And heart did not
>>>>> assume it to be stuck although I could not contact it.
>>>>> I've tried to to get a crashdump with 'kill -USR1' but it appeared that
>>>>> once again crash dump was truncated. Does heart kills "dead" erlang
>>>>> node?
>>>>> And the only thing that could be seen from the crash dump that the only
>>>>> running process was user_drv (just like in previous times) with program
>>>>> counter equal to "user_drv:server_loop/5 + 48". Is it possible to find
>>>>> out
>>>>> what exactly does it stands for?
>>>>>
>>>>>
>>>> Waiting on receive in that function. And you are observing this on a
>>>> 32-bit VM.
>>>>
>>>> --
>>>> WBR,
>>>>  DN
>>>>
>>>>
>>
>> --
>> WBR,
>>  DN
>>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>


-- 
Best Regards,
- Ahmed Omar
http://nl.linkedin.com/in/adiaa
Follow me on twitter
@spawn_think <http://twitter.com/#!/spawn_think>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111128/02c88f21/attachment.htm>


More information about the erlang-questions mailing list