<div dir="ltr"><div>Edwin,</div>
<div> </div>
<div>Yes, it's built from source. Tried by doing "make clean" as well. I enabled smp, threads, etc. The compiler is gcc. The node is started with smp enabled, 196 OS threads and hipe. I tried Kernal poll enabled and disabled, but both the times same issue. Platform is 32bit.</div>
<div> </div>
<div>Latest firmware for the machine and OS patches are installed. There're 3 same config machines and every machine gives the same Erlang issue. Like I said, it's not all the Erlang nodes getting the issue at the same time. I wonder if it's a Erl scheduler blockage or something. Also there's no any real time process running that I could think if causing this. The CPUs are almost 98% free.</div>
<div> </div>
<div>Like I said earlier, the Erl node is not totall stucked. Single RPC commands work but recursive functions getting stuck. Running processes halts and "Reductions" don't increase. The Erl prompt is not responding when doing "to_erl"<br>
</div>
<div>If you or anyboby run or tested Erlang with 8 or more Intel cores, pls let me know. I want to first narrow down the problem to identify if it's a problem only in my instalation or an Erl smp problem. If it's the former,much releaved.</div>
<div> </div>
<div>Thanks,</div>
<div>- Eranga</div>
<div> </div>
<div><br> </div>
<div class="gmail_quote">On Mon, Sep 1, 2008 at 1:38 AM, Edwin Fine <span dir="ltr"><<a href="mailto:erlang-questions_efine@usa.net">erlang-questions_efine@usa.net</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<div dir="ltr">Eranga,<br><br>You didn't say whether or not you had built the Erlang releases from source. If not, it might be a good idea to try that on the target system (./configure; make clean; make). If so, which compiler are you using? <br>
<br>I won't ask all the obvious non-Erlang questions like, do you have the latest firmware for the machine, latest patches for RH 5.1, have you run a full hardware/memory test suite (maybe the machine is flaky), have you stopped all other non-critical processes and applications (maybe there's a rogue real-time process?) etc.<br>
<br>You also didn't say if you were running 32-bit or 64-bit Linux.<br><br>
<div class="gmail_quote">2008/8/31 Eranga Udesh <span dir="ltr"><<a href="mailto:eranga.erl@gmail.com" target="_blank">eranga.erl@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0pt 0pt 0pt 0.8ex; BORDER-LEFT: rgb(204,204,204) 1px solid">
<div>
<div></div>
<div class="Wj3C7c">
<div dir="ltr">
<div>Hi,</div>
<div> </div>
<div>I recently installed a new Erlang release 12B-3 on a 8 core (2 x 4 core Intel Zeon) HP DL 580 G5 machine running RH Linux 5.1. However this gives misterious errors in process handling. The observations are as below.</div>
<div> </div>
<div>1. The Erlang node occassionally getting killed by heart. Reason for termination "heart-beat time-out". Sometimes after a couple of hours but sometimes in few minutes. Pls note this is without any workload on this Erl node. All the server cores are 98% free.</div>
<div> </div>
<div>2. I started without heart and ran 2 recursive functons, 1 spawned and 1 in the Emulator prompt. This function outputs time every 1 second. When this issue occurs,the output stops. Which means the recursive functions stop working. Since heart is not started, Erl node is not getting restarted.</div>
<div> </div>
<div>2.1. If I leave the Erl node for sometime, the sometimes things comes back to normal (tested for after 15-20 mins. othertimes never recovered). The outputs starts again.</div>
<div> </div>
<div>2.2. I can do net:ping/1 from another Erl node. If I do a RPC call from another Erl node, it works. However if I run a recursive function, it runs once and getting stuck. The RPC call waits till timeout.</div>
<div> </div>
<div>2.3 When this occurs, the connections from other Erl nodes are getting connection timeout and connection removes. However, like said above I can still do net:ping/1 and connect. If no activity done, again timeouts a little later.</div>
<div> </div>
<div>2.4 I used etop to check whats going on. I hav a SCTP based application running and if a SCTP message comes, it's getting handled without any problem. I can see the recursive functions running and waiting and timer:sleep(1000) clause, but there's no increment in "Reductions" counter or no time output.</div>
<div> </div>
<div>3. I have multiple Erlang releases running in this server. All of them face this issue, but not all at once. Their issue comes in different times and no pattern could observe. Even pure Mnesia DB Erl nodes face the same issue.</div>
<div> </div>
<div>4. I tried with SMP enabled/disabled, different +S <values>, but nothing works.</div>
<div> </div>
<div>5. I tried with RH Linux 5.0/5.1, Erlang 12B-2/12B-3 but still the same.</div>
<div> </div>
<div>6. The same Erl release in a different HP 4 core machine in same RH Linux works fine.</div>
<div> </div>
<div>Is there anybody who faced/face similar issues? What could be the cause for this? If there's any other debug dumps I should take, let me know. I tried everything and spent about a week, without any luck. I urgently need to find and fix the issue.</div>
<div> </div>
<div>Any advice is valuable.</div>
<div> </div>
<div>Thanks,</div>
<div>- Eranga</div></div><br></div></div>_______________________________________________<br>erlang-questions mailing list<br><a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>
<a href="http://www.erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://www.erlang.org/mailman/listinfo/erlang-questions</a><br></blockquote></div><br><br clear="all"><br>-- <br>For every expert there is an equal and opposite expert - Arthur C. Clarke<br>
</div></blockquote></div><br></div>