[erlang-questions] How to diagnose stuck Erlang node

Chandru <>
Mon Nov 28 16:22:17 CET 2011


One option is to receive the trace messages into your own process running
on another node perhaps and probably only keeping the last few (the
definition of few being the size of hardware available to you).

dbg:tracer(port, dbg:trace_port(ip, PortNumber).

On the receiving side:
dbg:trace_client(ip, {HostName, PortNumber}, HandlerSpec).

cheers
Chandru

On 28 November 2011 15:06, Kirill Zaborsky <> wrote:

> Thanks Ahmed for you advices but the problem is that I do no know what
> triggers the problem and when it may happen.
> Putting all user_drv messages into a file may result it very huge file (at
> the moment I'm thinking about decreasing amount of info going to stdout).
> But maybe it will bring some more details, thanks once again.
>
> Kind regards,
> Kirill Zaborsky
>
>
> 2011/11/28 Ahmed Omar <>
>
>> Krill,
>> How about using some dbg?
>> Note : you need to be careful not to generate huge amount of debugging on
>> your node that could actually kill your node. I don't know much about your
>> system and how heavy the load on nodes is.
>>
>> You can do something like :
>> dbg:tracer(port, dbg:trace_port(file, "/tmp/usr_drv.dbg")).
>> dbg:p(user_drv, [m, c]).
>> dbg:tpl(user_drv, x).
>>
>> This way you will have a file that contains tracing of all messages and
>> local function calls done in process user_drv
>> When the node stuck we can see more details about what happened (note :
>> watch for the file size too)
>>
>> On Mon, Nov 28, 2011 at 1:37 PM, Kirill Zaborsky <>wrote:
>>
>>> Thanks, dennis,
>>> I have created a crash dump on a test machine (using halfword emulator)
>>> and received user_drv in waiting state with
>>> Program counter - 0x0000000002845f00 (user_drv:server_loop/5 + 48)
>>> So it's on the same instruction (but not running)
>>>
>>> Disassembly shows:
>>> -----------------
>>> 0000000002845E80: i_func_info_IaaI 0 user_drv server_loop 5
>>> 0000000002845EA8: allocate_init_tIy 7 5 y(0)
>>> 0000000002845EC0: init_y y(1)
>>> 0000000002845ED0: move2_xyxy x(4) y(2) x(3) y(3)
>>> 0000000002845EE0: move2_xyxy x(2) y(4) x(1) y(5)
>>> 0000000002845EF0: move_ry x(0) y(6)
>>> 0000000002845F00: i_loop_rec_fr f(0000000002846C80) x(0)
>>> 0000000002845F10: i_select_tuple_arity2_rfAfAf x(0) f(0000000002846C40)
>>> 2 f(0000000002845F40) 3 f(0000000002846418)
>>> 0000000002845F40: i_get_tuple_element_rPx x(0) 0 x(1)
>>> .....
>>> 0000000002846C80: wait_f f(0000000002845F00)
>>> 0000000002846C90: badmatch_r x(0)
>>> -----------------
>>> So it's just a waiting loop. I don'see how the process could be running
>>> when the only ouput for some time was "ALIVE" messages every 15 minutes
>>> from run_erl.
>>> Loooks like the only way to see what was going on is to get complete
>>> crash dump, but it was truncated by heart :-\
>>>
>>> P.S. It's quite strange that crash dump shows +48
>>>
>>> 2011/11/28 <>
>>>
>>> +48 does not point to an instruction start on a couple of 32-bit systems
>>>> I have access to, so I can not assist you further.
>>>>
>>>> To get instructions dump named "user_drv.dis" in the beam process
>>>> working directory you can do
>>>>
>>>> erts_debug:df(user_drv).
>>>>
>>>> Happy bug-hunting.
>>>>
>>>>
>>>>
>>>> On Mon, 28 Nov 2011 12:01:17 +0200, Kirill Zaborsky <>
>>>> wrote:
>>>>
>>>>  I'm using halfword emulator on 64bit Ubuntu Server
>>>>> And the process state is not "waiting" but "running". Previous crash
>>>>> dumps
>>>>> show the same program counter value (and user_drv in running state)
>>>>>
>>>>> Kind regards,
>>>>> Kirill Zaborsky
>>>>>
>>>>>
>>>>> 2011/11/28 Dennis Novikov <>
>>>>>
>>>>>  On Mon, 28 Nov 2011 08:44:42 +0200, Kirill Zaborsky <
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>>  Trying to fins any workaround to this "stuck node" scenario I've
>>>>>> upgraded
>>>>>>
>>>>>>> to R14B04 and turned on "heart".
>>>>>>> But recently  the node once again stopped responding. And heart did
>>>>>>> not
>>>>>>> assume it to be stuck although I could not contact it.
>>>>>>> I've tried to to get a crashdump with 'kill -USR1' but it appeared
>>>>>>> that
>>>>>>> once again crash dump was truncated. Does heart kills "dead" erlang
>>>>>>> node?
>>>>>>> And the only thing that could be seen from the crash dump that the
>>>>>>> only
>>>>>>> running process was user_drv (just like in previous times) with
>>>>>>> program
>>>>>>> counter equal to "user_drv:server_loop/5 + 48". Is it possible to
>>>>>>> find out
>>>>>>> what exactly does it stands for?
>>>>>>>
>>>>>>>
>>>>>> Waiting on receive in that function. And you are observing this on a
>>>>>> 32-bit VM.
>>>>>>
>>>>>> --
>>>>>> WBR,
>>>>>>  DN
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> WBR,
>>>>  DN
>>>>
>>>
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> 
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> - Ahmed Omar
>> http://nl.linkedin.com/in/adiaa
>> Follow me on twitter
>> @spawn_think <http://twitter.com/#%21/spawn_think>
>>
>>
>
> _______________________________________________
> erlang-questions mailing list
> 
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111128/b3b42ac6/attachment.html>


More information about the erlang-questions mailing list