[erlang-questions] How to diagnose stuck Erlang node

Ahmed Omar spawn.think@REDACTED
Mon Nov 28 19:54:14 CET 2011


That's nice to know, thanks Raimo :)

On Mon, Nov 28, 2011 at 5:33 PM, Raimo Niskanen <
raimo+erlang-questions@REDACTED> wrote:

> On Mon, Nov 28, 2011 at 05:13:17PM +0100, Ahmed Omar wrote:
> > It's kind of cheating, as it's not documented.(AFAIK)
> > x is euqal to
> > calling dbg:fun2ms(fun(_)-> exception_trace() end).
> > which generates MatchSpec :
> > [{'_',[],[{exception_trace}]}]
> >
> > from documentation of match specs
> > http://www.erlang.org/doc/apps/erts/match_spec.html
> > *exception_trace:* Same as *return_trace*, plus; if the traced function
> > exits due to an exception, anexception_from trace message is generated,
> > whether the exception is caught or not.
>
> Well, if you try dbg:ltp() you will see all your saved match specs
> along with the "predefined" match specs, and this is actually
> documented in dbg.
>
> And, the predefined match specs will most probably not be removed
> nor changed. Not without very good reason.
>
> So it is not cheating.
>
> In a recent system you will also find predefined match specs
> with the short abbreviations 'c' and 'cx'. Have a look.
>
> >
> > example:
> > 37> dbg:tracer().
> > {ok,<0.85.0>}
> > 38> dbg:tpl(calendar, x).
> > {ok,[{matched,nonode@REDACTED,39},{saved,x}]}
> > 39> dbg:p(all,c).
> > {ok,[{matched,nonode@REDACTED,26}]}
> > 40> calendar:local_time_to_universal_time(33).
> > (<0.83.0>) call calendar:local_time_to_universal_time(33)
> > (<0.83.0>) exception_from {calendar,local_time_to_universal_time,1}
> > {error,badarg}
> > ** exception error: bad argument
> >      in function  erlang:localtime_to_universaltime/2
> >         called as erlang:localtime_to_universaltime(33,undefined)
> >      in call from erlang:localtime_to_universaltime/1
> >
> >
> > I think better use the documented form :)
> > dbg:tpl(usr_drv, dbg:fun2ms(fun(_)-> exception_trace() end)).
> >
> >
> > On Mon, Nov 28, 2011 at 4:38 PM, Kirill Zaborsky <qrilka@REDACTED>
> wrote:
> >
> > > Ahmed, minor question: what does 'x' mean in dbg:tpl(user_drv, x) ?
> From
> > > the documentation I do not see what this match specification mya mean.
> > >
> > > Kind regards,
> > > Kirill Zaborsky
> > >
> > >
> > > 2011/11/28 Ahmed Omar <spawn.think@REDACTED>
> > >
> > >> Krill,
> > >> How about using some dbg?
> > >> Note : you need to be careful not to generate huge amount of
> debugging on
> > >> your node that could actually kill your node. I don't know much about
> your
> > >> system and how heavy the load on nodes is.
> > >>
> > >> You can do something like :
> > >> dbg:tracer(port, dbg:trace_port(file, "/tmp/usr_drv.dbg")).
> > >> dbg:p(user_drv, [m, c]).
> > >> dbg:tpl(user_drv, x).
> > >>
> > >> This way you will have a file that contains tracing of all messages
> and
> > >> local function calls done in process user_drv
> > >> When the node stuck we can see more details about what happened (note
> :
> > >> watch for the file size too)
> > >>
> > >> On Mon, Nov 28, 2011 at 1:37 PM, Kirill Zaborsky <qrilka@REDACTED
> >wrote:
> > >>
> > >>> Thanks, dennis,
> > >>> I have created a crash dump on a test machine (using halfword
> emulator)
> > >>> and received user_drv in waiting state with
> > >>> Program counter - 0x0000000002845f00 (user_drv:server_loop/5 + 48)
> > >>> So it's on the same instruction (but not running)
> > >>>
> > >>> Disassembly shows:
> > >>> -----------------
> > >>> 0000000002845E80: i_func_info_IaaI 0 user_drv server_loop 5
> > >>> 0000000002845EA8: allocate_init_tIy 7 5 y(0)
> > >>> 0000000002845EC0: init_y y(1)
> > >>> 0000000002845ED0: move2_xyxy x(4) y(2) x(3) y(3)
> > >>> 0000000002845EE0: move2_xyxy x(2) y(4) x(1) y(5)
> > >>> 0000000002845EF0: move_ry x(0) y(6)
> > >>> 0000000002845F00: i_loop_rec_fr f(0000000002846C80) x(0)
> > >>> 0000000002845F10: i_select_tuple_arity2_rfAfAf x(0)
> f(0000000002846C40)
> > >>> 2 f(0000000002845F40) 3 f(0000000002846418)
> > >>> 0000000002845F40: i_get_tuple_element_rPx x(0) 0 x(1)
> > >>> .....
> > >>> 0000000002846C80: wait_f f(0000000002845F00)
> > >>> 0000000002846C90: badmatch_r x(0)
> > >>> -----------------
> > >>> So it's just a waiting loop. I don'see how the process could be
> running
> > >>> when the only ouput for some time was "ALIVE" messages every 15
> minutes
> > >>> from run_erl.
> > >>> Loooks like the only way to see what was going on is to get complete
> > >>> crash dump, but it was truncated by heart :-\
> > >>>
> > >>> P.S. It's quite strange that crash dump shows +48
> > >>>
> > >>> 2011/11/28 <dennis.novikov@REDACTED>
> > >>>
> > >>> +48 does not point to an instruction start on a couple of 32-bit
> systems
> > >>>> I have access to, so I can not assist you further.
> > >>>>
> > >>>> To get instructions dump named "user_drv.dis" in the beam process
> > >>>> working directory you can do
> > >>>>
> > >>>> erts_debug:df(user_drv).
> > >>>>
> > >>>> Happy bug-hunting.
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Mon, 28 Nov 2011 12:01:17 +0200, Kirill Zaborsky <
> qrilka@REDACTED>
> > >>>> wrote:
> > >>>>
> > >>>>  I'm using halfword emulator on 64bit Ubuntu Server
> > >>>>> And the process state is not "waiting" but "running". Previous
> crash
> > >>>>> dumps
> > >>>>> show the same program counter value (and user_drv in running state)
> > >>>>>
> > >>>>> Kind regards,
> > >>>>> Kirill Zaborsky
> > >>>>>
> > >>>>>
> > >>>>> 2011/11/28 Dennis Novikov <dennis.novikov@REDACTED>
> > >>>>>
> > >>>>>  On Mon, 28 Nov 2011 08:44:42 +0200, Kirill Zaborsky <
> qrilka@REDACTED
> > >>>>>> >
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>  Trying to fins any workaround to this "stuck node" scenario I've
> > >>>>>> upgraded
> > >>>>>>
> > >>>>>>> to R14B04 and turned on "heart".
> > >>>>>>> But recently  the node once again stopped responding. And heart
> did
> > >>>>>>> not
> > >>>>>>> assume it to be stuck although I could not contact it.
> > >>>>>>> I've tried to to get a crashdump with 'kill -USR1' but it
> appeared
> > >>>>>>> that
> > >>>>>>> once again crash dump was truncated. Does heart kills "dead"
> erlang
> > >>>>>>> node?
> > >>>>>>> And the only thing that could be seen from the crash dump that
> the
> > >>>>>>> only
> > >>>>>>> running process was user_drv (just like in previous times) with
> > >>>>>>> program
> > >>>>>>> counter equal to "user_drv:server_loop/5 + 48". Is it possible to
> > >>>>>>> find out
> > >>>>>>> what exactly does it stands for?
> > >>>>>>>
> > >>>>>>>
> > >>>>>> Waiting on receive in that function. And you are observing this
> on a
> > >>>>>> 32-bit VM.
> > >>>>>>
> > >>>>>> --
> > >>>>>> WBR,
> > >>>>>>  DN
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>> --
> > >>>> WBR,
> > >>>>  DN
> > >>>>
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> erlang-questions mailing list
> > >>> erlang-questions@REDACTED
> > >>> http://erlang.org/mailman/listinfo/erlang-questions
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Best Regards,
> > >> - Ahmed Omar
> > >> http://nl.linkedin.com/in/adiaa
> > >> Follow me on twitter
> > >> @spawn_think <http://twitter.com/#!/spawn_think>
> > >>
> > >>
> > >
> >
> >
> > --
> > Best Regards,
> > - Ahmed Omar
> > http://nl.linkedin.com/in/adiaa
> > Follow me on twitter
> > @spawn_think <http://twitter.com/#!/spawn_think>
>
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
>
>
> --
>
> / Raimo Niskanen, Erlang/OTP, Ericsson AB
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>



-- 
Best Regards,
- Ahmed Omar
http://nl.linkedin.com/in/adiaa
Follow me on twitter
@spawn_think <http://twitter.com/#!/spawn_think>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111128/0fd565d1/attachment.htm>


More information about the erlang-questions mailing list