Thanks Ahmed for you advices but the problem is that I do no know what triggers the problem and when it may happen.<div>Putting all user_drv messages into a file may result it very huge file (at the moment I'm thinking about decreasing amount of info going to stdout). But maybe it will bring some more details, thanks once again.</div>
<div><br></div><div>Kind regards,</div><div>Kirill Zaborsky<br><div><br><div class="gmail_quote">2011/11/28 Ahmed Omar <span dir="ltr"><<a href="mailto:spawn.think@gmail.com">spawn.think@gmail.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Krill, <div>How about using some dbg? </div><div>Note : you need to be careful not to generate huge amount of debugging on your node that could actually kill your node. I don't know much about your system and how heavy the load on nodes is.</div>
<div><br></div><div>You can do something like : </div><div>dbg:tracer(port, dbg:trace_port(file, "/tmp/usr_drv.dbg")).</div><div>dbg:p(user_drv, [m, c]).</div><div>dbg:tpl(user_drv, x).</div><div><br></div><div>
This way you will have a file that contains tracing of all messages and local function calls done in process user_drv</div><div>When the node stuck we can see more details about what happened (note : watch for the file size too)<br>
<br><div class="gmail_quote"><div><div class="h5">On Mon, Nov 28, 2011 at 1:37 PM, Kirill Zaborsky <span dir="ltr"><<a href="mailto:qrilka@gmail.com" target="_blank">qrilka@gmail.com</a>></span> wrote:<br></div></div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">
Thanks, dennis,<div>I have created a crash dump on a test machine (using halfword emulator) and received user_drv in waiting state with</div><div>Program counter - 0x0000000002845f00 (user_drv:server_loop/5 + 48)</div><div>
So it's on the same instruction (but not running)</div><div><br></div><div>Disassembly shows:</div><div>-----------------</div><div><div>0000000002845E80: i_func_info_IaaI 0 user_drv server_loop 5 </div><div>0000000002845EA8: allocate_init_tIy 7 5 y(0) </div>
<div>0000000002845EC0: init_y y(1) </div><div>0000000002845ED0: move2_xyxy x(4) y(2) x(3) y(3) </div><div>0000000002845EE0: move2_xyxy x(2) y(4) x(1) y(5) </div><div>0000000002845EF0: move_ry x(0) y(6) </div><div>0000000002845F00: i_loop_rec_fr f(0000000002846C80) x(0) </div>
<div>0000000002845F10: i_select_tuple_arity2_rfAfAf x(0) f(0000000002846C40) 2 f(0000000002845F40) 3 f(0000000002846418) </div><div>0000000002845F40: i_get_tuple_element_rPx x(0) 0 x(1) </div><div>.....</div><div><div>0000000002846C80: wait_f f(0000000002845F00) </div>
<div>0000000002846C90: badmatch_r x(0) </div></div><div>-----------------</div><div>So it's just a waiting loop. I don'see how the process could be running when the only ouput for some time was "ALIVE" messages every 15 minutes from run_erl.</div>
<div>Loooks like the only way to see what was going on is to get complete crash dump, but it was truncated by heart :-\</div><div><br></div><div>P.S. It's quite strange that crash dump shows +48</div><br><div class="gmail_quote">
2011/11/28 <span dir="ltr"><<a href="mailto:dennis.novikov@gmail.com" target="_blank">dennis.novikov@gmail.com</a>></span><div><div><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
+48 does not point to an instruction start on a couple of 32-bit systems I have access to, so I can not assist you further.<br>
<br>
To get instructions dump named "user_drv.dis" in the beam process working directory you can do<br>
<br>
erts_debug:df(user_drv).<br>
<br>
Happy bug-hunting.<div><div><br>
<br>
<br>
On Mon, 28 Nov 2011 12:01:17 +0200, Kirill Zaborsky <<a href="mailto:qrilka@gmail.com" target="_blank">qrilka@gmail.com</a>> wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I'm using halfword emulator on 64bit Ubuntu Server<br>
And the process state is not "waiting" but "running". Previous crash dumps<br>
show the same program counter value (and user_drv in running state)<br>
<br>
Kind regards,<br>
Kirill Zaborsky<br>
<br>
<br>
2011/11/28 Dennis Novikov <<a href="mailto:dennis.novikov@gmail.com" target="_blank">dennis.novikov@gmail.com</a>><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On Mon, 28 Nov 2011 08:44:42 +0200, Kirill Zaborsky <<a href="mailto:qrilka@gmail.com" target="_blank">qrilka@gmail.com</a>><br>
wrote:<br>
<br>
Trying to fins any workaround to this "stuck node" scenario I've upgraded<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
to R14B04 and turned on "heart".<br>
But recently the node once again stopped responding. And heart did not<br>
assume it to be stuck although I could not contact it.<br>
I've tried to to get a crashdump with 'kill -USR1' but it appeared that<br>
once again crash dump was truncated. Does heart kills "dead" erlang node?<br>
And the only thing that could be seen from the crash dump that the only<br>
running process was user_drv (just like in previous times) with program<br>
counter equal to "user_drv:server_loop/5 + 48". Is it possible to find out<br>
what exactly does it stands for?<br>
<br>
</blockquote>
<br>
Waiting on receive in that function. And you are observing this on a<br>
32-bit VM.<br>
<br>
--<br>
WBR,<br>
DN<br>
<br>
</blockquote></blockquote>
<br>
<br></div></div><span><font color="#888888">
-- <br>
WBR,<br>
DN<br>
</font></span></blockquote></div></div></div><br></div>
<br></div></div><div class="im">_______________________________________________<br>
erlang-questions mailing list<br>
<a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>
<br></div></blockquote></div><div class="im"><br><br clear="all"><div><br></div>-- <br>Best Regards,<br>- Ahmed Omar<div><a href="http://nl.linkedin.com/in/adiaa" target="_blank">http://nl.linkedin.com/in/adiaa</a></div>
<div>Follow me on twitter</div>
<div><a href="http://twitter.com/#!/spawn_think" target="_blank">@spawn_think</a></div><br>
</div></div>
</blockquote></div><br></div></div>