[erlang-questions] How to diagnose stuck Erlang node

Kirill Zaborsky qrilka@REDACTED
Mon Nov 28 16:38:54 CET 2011


Ahmed, minor question: what does 'x' mean in dbg:tpl(user_drv, x) ? From
the documentation I do not see what this match specification mya mean.

Kind regards,
Kirill Zaborsky

2011/11/28 Ahmed Omar <spawn.think@REDACTED>

> Krill,
> How about using some dbg?
> Note : you need to be careful not to generate huge amount of debugging on
> your node that could actually kill your node. I don't know much about your
> system and how heavy the load on nodes is.
>
> You can do something like :
> dbg:tracer(port, dbg:trace_port(file, "/tmp/usr_drv.dbg")).
> dbg:p(user_drv, [m, c]).
> dbg:tpl(user_drv, x).
>
> This way you will have a file that contains tracing of all messages and
> local function calls done in process user_drv
> When the node stuck we can see more details about what happened (note :
> watch for the file size too)
>
> On Mon, Nov 28, 2011 at 1:37 PM, Kirill Zaborsky <qrilka@REDACTED> wrote:
>
>> Thanks, dennis,
>> I have created a crash dump on a test machine (using halfword emulator)
>> and received user_drv in waiting state with
>> Program counter - 0x0000000002845f00 (user_drv:server_loop/5 + 48)
>> So it's on the same instruction (but not running)
>>
>> Disassembly shows:
>> -----------------
>> 0000000002845E80: i_func_info_IaaI 0 user_drv server_loop 5
>> 0000000002845EA8: allocate_init_tIy 7 5 y(0)
>> 0000000002845EC0: init_y y(1)
>> 0000000002845ED0: move2_xyxy x(4) y(2) x(3) y(3)
>> 0000000002845EE0: move2_xyxy x(2) y(4) x(1) y(5)
>> 0000000002845EF0: move_ry x(0) y(6)
>> 0000000002845F00: i_loop_rec_fr f(0000000002846C80) x(0)
>> 0000000002845F10: i_select_tuple_arity2_rfAfAf x(0) f(0000000002846C40) 2
>> f(0000000002845F40) 3 f(0000000002846418)
>> 0000000002845F40: i_get_tuple_element_rPx x(0) 0 x(1)
>> .....
>> 0000000002846C80: wait_f f(0000000002845F00)
>> 0000000002846C90: badmatch_r x(0)
>> -----------------
>> So it's just a waiting loop. I don'see how the process could be running
>> when the only ouput for some time was "ALIVE" messages every 15 minutes
>> from run_erl.
>> Loooks like the only way to see what was going on is to get complete
>> crash dump, but it was truncated by heart :-\
>>
>> P.S. It's quite strange that crash dump shows +48
>>
>> 2011/11/28 <dennis.novikov@REDACTED>
>>
>> +48 does not point to an instruction start on a couple of 32-bit systems
>>> I have access to, so I can not assist you further.
>>>
>>> To get instructions dump named "user_drv.dis" in the beam process
>>> working directory you can do
>>>
>>> erts_debug:df(user_drv).
>>>
>>> Happy bug-hunting.
>>>
>>>
>>>
>>> On Mon, 28 Nov 2011 12:01:17 +0200, Kirill Zaborsky <qrilka@REDACTED>
>>> wrote:
>>>
>>>  I'm using halfword emulator on 64bit Ubuntu Server
>>>> And the process state is not "waiting" but "running". Previous crash
>>>> dumps
>>>> show the same program counter value (and user_drv in running state)
>>>>
>>>> Kind regards,
>>>> Kirill Zaborsky
>>>>
>>>>
>>>> 2011/11/28 Dennis Novikov <dennis.novikov@REDACTED>
>>>>
>>>>  On Mon, 28 Nov 2011 08:44:42 +0200, Kirill Zaborsky <qrilka@REDACTED>
>>>>> wrote:
>>>>>
>>>>>  Trying to fins any workaround to this "stuck node" scenario I've
>>>>> upgraded
>>>>>
>>>>>> to R14B04 and turned on "heart".
>>>>>> But recently  the node once again stopped responding. And heart did
>>>>>> not
>>>>>> assume it to be stuck although I could not contact it.
>>>>>> I've tried to to get a crashdump with 'kill -USR1' but it appeared
>>>>>> that
>>>>>> once again crash dump was truncated. Does heart kills "dead" erlang
>>>>>> node?
>>>>>> And the only thing that could be seen from the crash dump that the
>>>>>> only
>>>>>> running process was user_drv (just like in previous times) with
>>>>>> program
>>>>>> counter equal to "user_drv:server_loop/5 + 48". Is it possible to
>>>>>> find out
>>>>>> what exactly does it stands for?
>>>>>>
>>>>>>
>>>>> Waiting on receive in that function. And you are observing this on a
>>>>> 32-bit VM.
>>>>>
>>>>> --
>>>>> WBR,
>>>>>  DN
>>>>>
>>>>>
>>>
>>> --
>>> WBR,
>>>  DN
>>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>
>
> --
> Best Regards,
> - Ahmed Omar
> http://nl.linkedin.com/in/adiaa
> Follow me on twitter
> @spawn_think <http://twitter.com/#!/spawn_think>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111128/c1bee0ac/attachment.htm>


More information about the erlang-questions mailing list