[erlang-questions] How to diagnose stuck Erlang node

Mon Nov 28 17:13:17 CET 2011

It's kind of cheating, as it's not documented.(AFAIK)
x is euqal to
calling dbg:fun2ms(fun(_)-> exception_trace() end).
which generates MatchSpec :
[{'_',[],[{exception_trace}]}]

from documentation of match specs
http://www.erlang.org/doc/apps/erts/match_spec.html
*exception_trace:* Same as *return_trace*, plus; if the traced function
exits due to an exception, anexception_from trace message is generated,
whether the exception is caught or not.

example:
37> dbg:tracer().
{ok,<0.85.0>}
38> dbg:tpl(calendar, x).
{ok,[{matched,nonode@REDACTED,39},{saved,x}]}
39> dbg:p(all,c).
{ok,[{matched,nonode@REDACTED,26}]}
40> calendar:local_time_to_universal_time(33).
(<0.83.0>) call calendar:local_time_to_universal_time(33)
(<0.83.0>) exception_from {calendar,local_time_to_universal_time,1}
{error,badarg}
** exception error: bad argument
     in function  erlang:localtime_to_universaltime/2
        called as erlang:localtime_to_universaltime(33,undefined)
     in call from erlang:localtime_to_universaltime/1

I think better use the documented form :)
dbg:tpl(usr_drv, dbg:fun2ms(fun(_)-> exception_trace() end)).

On Mon, Nov 28, 2011 at 4:38 PM, Kirill Zaborsky <qrilka@REDACTED> wrote:

> Ahmed, minor question: what does 'x' mean in dbg:tpl(user_drv, x) ? From
> the documentation I do not see what this match specification mya mean.
>
> Kind regards,
> Kirill Zaborsky
>
>
> 2011/11/28 Ahmed Omar <spawn.think@REDACTED>
>
>> Krill,
>> How about using some dbg?
>> Note : you need to be careful not to generate huge amount of debugging on
>> your node that could actually kill your node. I don't know much about your
>> system and how heavy the load on nodes is.
>>
>> You can do something like :
>> dbg:tracer(port, dbg:trace_port(file, "/tmp/usr_drv.dbg")).
>> dbg:p(user_drv, [m, c]).
>> dbg:tpl(user_drv, x).
>>
>> This way you will have a file that contains tracing of all messages and
>> local function calls done in process user_drv
>> When the node stuck we can see more details about what happened (note :
>> watch for the file size too)
>>
>> On Mon, Nov 28, 2011 at 1:37 PM, Kirill Zaborsky <qrilka@REDACTED>wrote:
>>
>>> Thanks, dennis,
>>> I have created a crash dump on a test machine (using halfword emulator)
>>> and received user_drv in waiting state with
>>> Program counter - 0x0000000002845f00 (user_drv:server_loop/5 + 48)
>>> So it's on the same instruction (but not running)
>>>
>>> Disassembly shows:
>>> -----------------
>>> 0000000002845E80: i_func_info_IaaI 0 user_drv server_loop 5
>>> 0000000002845EA8: allocate_init_tIy 7 5 y(0)
>>> 0000000002845EC0: init_y y(1)
>>> 0000000002845ED0: move2_xyxy x(4) y(2) x(3) y(3)
>>> 0000000002845EE0: move2_xyxy x(2) y(4) x(1) y(5)
>>> 0000000002845EF0: move_ry x(0) y(6)
>>> 0000000002845F00: i_loop_rec_fr f(0000000002846C80) x(0)
>>> 0000000002845F10: i_select_tuple_arity2_rfAfAf x(0) f(0000000002846C40)
>>> 2 f(0000000002845F40) 3 f(0000000002846418)
>>> 0000000002845F40: i_get_tuple_element_rPx x(0) 0 x(1)
>>> .....
>>> 0000000002846C80: wait_f f(0000000002845F00)
>>> 0000000002846C90: badmatch_r x(0)
>>> -----------------
>>> So it's just a waiting loop. I don'see how the process could be running
>>> when the only ouput for some time was "ALIVE" messages every 15 minutes
>>> from run_erl.
>>> Loooks like the only way to see what was going on is to get complete
>>> crash dump, but it was truncated by heart :-\
>>>
>>> P.S. It's quite strange that crash dump shows +48
>>>
>>> 2011/11/28 <dennis.novikov@REDACTED>
>>>
>>> +48 does not point to an instruction start on a couple of 32-bit systems
>>>> I have access to, so I can not assist you further.
>>>>
>>>> To get instructions dump named "user_drv.dis" in the beam process
>>>> working directory you can do
>>>>
>>>> erts_debug:df(user_drv).
>>>>
>>>> Happy bug-hunting.
>>>>
>>>>
>>>>
>>>> On Mon, 28 Nov 2011 12:01:17 +0200, Kirill Zaborsky <qrilka@REDACTED>
>>>> wrote:
>>>>
>>>>  I'm using halfword emulator on 64bit Ubuntu Server
>>>>> And the process state is not "waiting" but "running". Previous crash
>>>>> dumps
>>>>> show the same program counter value (and user_drv in running state)
>>>>>
>>>>> Kind regards,
>>>>> Kirill Zaborsky
>>>>>
>>>>>
>>>>> 2011/11/28 Dennis Novikov <dennis.novikov@REDACTED>
>>>>>
>>>>>  On Mon, 28 Nov 2011 08:44:42 +0200, Kirill Zaborsky <qrilka@REDACTED
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>>  Trying to fins any workaround to this "stuck node" scenario I've
>>>>>> upgraded
>>>>>>
>>>>>>> to R14B04 and turned on "heart".
>>>>>>> But recently  the node once again stopped responding. And heart did
>>>>>>> not
>>>>>>> assume it to be stuck although I could not contact it.
>>>>>>> I've tried to to get a crashdump with 'kill -USR1' but it appeared
>>>>>>> that
>>>>>>> once again crash dump was truncated. Does heart kills "dead" erlang
>>>>>>> node?
>>>>>>> And the only thing that could be seen from the crash dump that the
>>>>>>> only
>>>>>>> running process was user_drv (just like in previous times) with
>>>>>>> program
>>>>>>> counter equal to "user_drv:server_loop/5 + 48". Is it possible to
>>>>>>> find out
>>>>>>> what exactly does it stands for?
>>>>>>>
>>>>>>>
>>>>>> Waiting on receive in that function. And you are observing this on a
>>>>>> 32-bit VM.
>>>>>>
>>>>>> --
>>>>>> WBR,
>>>>>>  DN
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> WBR,
>>>>  DN
>>>>
>>>
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> - Ahmed Omar
>> http://nl.linkedin.com/in/adiaa
>> Follow me on twitter
>> @spawn_think <http://twitter.com/#!/spawn_think>
>>
>>
>

-- 
Best Regards,
- Ahmed Omar
http://nl.linkedin.com/in/adiaa
Follow me on twitter
@spawn_think <http://twitter.com/#!/spawn_think>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111128/605b9a58/attachment.htm>