[erlang-questions] How to diagnose stuck Erlang node

Raimo Niskanen raimo+erlang-questions@REDACTED
Mon Nov 28 17:33:54 CET 2011


On Mon, Nov 28, 2011 at 05:13:17PM +0100, Ahmed Omar wrote:
> It's kind of cheating, as it's not documented.(AFAIK)
> x is euqal to
> calling dbg:fun2ms(fun(_)-> exception_trace() end).
> which generates MatchSpec :
> [{'_',[],[{exception_trace}]}]
> 
> from documentation of match specs
> http://www.erlang.org/doc/apps/erts/match_spec.html
> *exception_trace:* Same as *return_trace*, plus; if the traced function
> exits due to an exception, anexception_from trace message is generated,
> whether the exception is caught or not.

Well, if you try dbg:ltp() you will see all your saved match specs
along with the "predefined" match specs, and this is actually
documented in dbg.

And, the predefined match specs will most probably not be removed
nor changed. Not without very good reason.

So it is not cheating.

In a recent system you will also find predefined match specs
with the short abbreviations 'c' and 'cx'. Have a look.

> 
> example:
> 37> dbg:tracer().
> {ok,<0.85.0>}
> 38> dbg:tpl(calendar, x).
> {ok,[{matched,nonode@REDACTED,39},{saved,x}]}
> 39> dbg:p(all,c).
> {ok,[{matched,nonode@REDACTED,26}]}
> 40> calendar:local_time_to_universal_time(33).
> (<0.83.0>) call calendar:local_time_to_universal_time(33)
> (<0.83.0>) exception_from {calendar,local_time_to_universal_time,1}
> {error,badarg}
> ** exception error: bad argument
>      in function  erlang:localtime_to_universaltime/2
>         called as erlang:localtime_to_universaltime(33,undefined)
>      in call from erlang:localtime_to_universaltime/1
> 
> 
> I think better use the documented form :)
> dbg:tpl(usr_drv, dbg:fun2ms(fun(_)-> exception_trace() end)).
> 
> 
> On Mon, Nov 28, 2011 at 4:38 PM, Kirill Zaborsky <qrilka@REDACTED> wrote:
> 
> > Ahmed, minor question: what does 'x' mean in dbg:tpl(user_drv, x) ? From
> > the documentation I do not see what this match specification mya mean.
> >
> > Kind regards,
> > Kirill Zaborsky
> >
> >
> > 2011/11/28 Ahmed Omar <spawn.think@REDACTED>
> >
> >> Krill,
> >> How about using some dbg?
> >> Note : you need to be careful not to generate huge amount of debugging on
> >> your node that could actually kill your node. I don't know much about your
> >> system and how heavy the load on nodes is.
> >>
> >> You can do something like :
> >> dbg:tracer(port, dbg:trace_port(file, "/tmp/usr_drv.dbg")).
> >> dbg:p(user_drv, [m, c]).
> >> dbg:tpl(user_drv, x).
> >>
> >> This way you will have a file that contains tracing of all messages and
> >> local function calls done in process user_drv
> >> When the node stuck we can see more details about what happened (note :
> >> watch for the file size too)
> >>
> >> On Mon, Nov 28, 2011 at 1:37 PM, Kirill Zaborsky <qrilka@REDACTED>wrote:
> >>
> >>> Thanks, dennis,
> >>> I have created a crash dump on a test machine (using halfword emulator)
> >>> and received user_drv in waiting state with
> >>> Program counter - 0x0000000002845f00 (user_drv:server_loop/5 + 48)
> >>> So it's on the same instruction (but not running)
> >>>
> >>> Disassembly shows:
> >>> -----------------
> >>> 0000000002845E80: i_func_info_IaaI 0 user_drv server_loop 5
> >>> 0000000002845EA8: allocate_init_tIy 7 5 y(0)
> >>> 0000000002845EC0: init_y y(1)
> >>> 0000000002845ED0: move2_xyxy x(4) y(2) x(3) y(3)
> >>> 0000000002845EE0: move2_xyxy x(2) y(4) x(1) y(5)
> >>> 0000000002845EF0: move_ry x(0) y(6)
> >>> 0000000002845F00: i_loop_rec_fr f(0000000002846C80) x(0)
> >>> 0000000002845F10: i_select_tuple_arity2_rfAfAf x(0) f(0000000002846C40)
> >>> 2 f(0000000002845F40) 3 f(0000000002846418)
> >>> 0000000002845F40: i_get_tuple_element_rPx x(0) 0 x(1)
> >>> .....
> >>> 0000000002846C80: wait_f f(0000000002845F00)
> >>> 0000000002846C90: badmatch_r x(0)
> >>> -----------------
> >>> So it's just a waiting loop. I don'see how the process could be running
> >>> when the only ouput for some time was "ALIVE" messages every 15 minutes
> >>> from run_erl.
> >>> Loooks like the only way to see what was going on is to get complete
> >>> crash dump, but it was truncated by heart :-\
> >>>
> >>> P.S. It's quite strange that crash dump shows +48
> >>>
> >>> 2011/11/28 <dennis.novikov@REDACTED>
> >>>
> >>> +48 does not point to an instruction start on a couple of 32-bit systems
> >>>> I have access to, so I can not assist you further.
> >>>>
> >>>> To get instructions dump named "user_drv.dis" in the beam process
> >>>> working directory you can do
> >>>>
> >>>> erts_debug:df(user_drv).
> >>>>
> >>>> Happy bug-hunting.
> >>>>
> >>>>
> >>>>
> >>>> On Mon, 28 Nov 2011 12:01:17 +0200, Kirill Zaborsky <qrilka@REDACTED>
> >>>> wrote:
> >>>>
> >>>>  I'm using halfword emulator on 64bit Ubuntu Server
> >>>>> And the process state is not "waiting" but "running". Previous crash
> >>>>> dumps
> >>>>> show the same program counter value (and user_drv in running state)
> >>>>>
> >>>>> Kind regards,
> >>>>> Kirill Zaborsky
> >>>>>
> >>>>>
> >>>>> 2011/11/28 Dennis Novikov <dennis.novikov@REDACTED>
> >>>>>
> >>>>>  On Mon, 28 Nov 2011 08:44:42 +0200, Kirill Zaborsky <qrilka@REDACTED
> >>>>>> >
> >>>>>> wrote:
> >>>>>>
> >>>>>>  Trying to fins any workaround to this "stuck node" scenario I've
> >>>>>> upgraded
> >>>>>>
> >>>>>>> to R14B04 and turned on "heart".
> >>>>>>> But recently  the node once again stopped responding. And heart did
> >>>>>>> not
> >>>>>>> assume it to be stuck although I could not contact it.
> >>>>>>> I've tried to to get a crashdump with 'kill -USR1' but it appeared
> >>>>>>> that
> >>>>>>> once again crash dump was truncated. Does heart kills "dead" erlang
> >>>>>>> node?
> >>>>>>> And the only thing that could be seen from the crash dump that the
> >>>>>>> only
> >>>>>>> running process was user_drv (just like in previous times) with
> >>>>>>> program
> >>>>>>> counter equal to "user_drv:server_loop/5 + 48". Is it possible to
> >>>>>>> find out
> >>>>>>> what exactly does it stands for?
> >>>>>>>
> >>>>>>>
> >>>>>> Waiting on receive in that function. And you are observing this on a
> >>>>>> 32-bit VM.
> >>>>>>
> >>>>>> --
> >>>>>> WBR,
> >>>>>>  DN
> >>>>>>
> >>>>>>
> >>>>
> >>>> --
> >>>> WBR,
> >>>>  DN
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> erlang-questions mailing list
> >>> erlang-questions@REDACTED
> >>> http://erlang.org/mailman/listinfo/erlang-questions
> >>>
> >>>
> >>
> >>
> >> --
> >> Best Regards,
> >> - Ahmed Omar
> >> http://nl.linkedin.com/in/adiaa
> >> Follow me on twitter
> >> @spawn_think <http://twitter.com/#!/spawn_think>
> >>
> >>
> >
> 
> 
> -- 
> Best Regards,
> - Ahmed Omar
> http://nl.linkedin.com/in/adiaa
> Follow me on twitter
> @spawn_think <http://twitter.com/#!/spawn_think>

> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions


-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB



More information about the erlang-questions mailing list