[erlang-questions] Erl getting stuck in SMP with 8 cores

Edwin Fine erlang-questions_efine@REDACTED
Sun Aug 31 22:08:57 CEST 2008


You didn't say whether or not you had built the Erlang releases from source.
If not, it might be a good idea to try that on the target system
(./configure; make clean; make). If so, which compiler are you using?

I won't ask all the obvious non-Erlang questions like, do you have the
latest firmware for the machine, latest patches for RH 5.1, have you run a
full hardware/memory test suite (maybe the machine is flaky), have you
stopped all other non-critical processes and applications (maybe there's a
rogue real-time process?) etc.

You also didn't say if you were running 32-bit or 64-bit Linux.

2008/8/31 Eranga Udesh <eranga.erl@REDACTED>

> Hi,
> I recently installed a new Erlang release 12B-3 on a 8 core (2 x 4 core
> Intel Zeon) HP DL 580 G5 machine running RH Linux 5.1. However this gives
> misterious errors in process handling. The observations are as below.
> 1. The Erlang node occassionally getting killed by heart. Reason for
> termination "heart-beat time-out". Sometimes after a couple of hours but
> sometimes in few minutes. Pls note this is without any workload on this Erl
> node. All the server cores are 98% free.
> 2. I started without heart and ran 2 recursive functons, 1 spawned and 1 in
> the Emulator prompt. This function outputs time every 1 second. When this
> issue occurs,the output stops. Which means the recursive functions stop
> working. Since heart is not started, Erl node is not getting restarted.
> 2.1. If I leave the Erl node for sometime, the sometimes things comes back
> to normal (tested for after 15-20 mins. othertimes never recovered). The
> outputs starts again.
> 2.2. I can do net:ping/1 from another Erl node. If I do a RPC call from
> another Erl node, it works. However if I run a recursive function, it runs
> once and getting stuck. The RPC call waits till timeout.
> 2.3 When this occurs, the connections from other Erl nodes are getting
> connection timeout and connection removes. However, like said above I can
> still do net:ping/1 and connect. If no activity done, again timeouts a
> little later.
> 2.4 I used etop to check whats going on. I hav a SCTP based application
> running and if a SCTP message comes, it's getting handled without any
> problem. I can see the recursive functions running and waiting and
> timer:sleep(1000) clause, but there's no increment in "Reductions" counter
> or no time output.
> 3. I have multiple Erlang releases running in this server. All of them face
> this issue, but not all at once. Their issue comes in different times and no
> pattern could observe. Even pure Mnesia DB Erl nodes face the same issue.
> 4. I tried with SMP enabled/disabled, different +S <values>, but nothing
> works.
> 5. I tried with RH Linux 5.0/5.1, Erlang 12B-2/12B-3 but still the same.
> 6. The same Erl release in a different HP 4 core machine in same RH Linux
> works fine.
> Is there anybody who faced/face similar issues? What could be the cause for
> this? If there's any other debug dumps I should take, let me know. I tried
> everything and spent about a week, without any luck. I urgently need to find
> and fix the issue.
> Any advice is valuable.
> Thanks,
> - Eranga
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions

For every expert there is an equal and opposite expert - Arthur C. Clarke
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080831/8cec3cd6/attachment.htm>

More information about the erlang-questions mailing list