[erlang-questions] Erl getting stuck in SMP with 8 cores

Eranga Udesh eranga.erl@REDACTED
Sun Aug 31 21:22:45 CEST 2008


Hi,

I recently installed a new Erlang release 12B-3 on a 8 core (2 x 4 core
Intel Zeon) HP DL 580 G5 machine running RH Linux 5.1. However this gives
misterious errors in process handling. The observations are as below.

1. The Erlang node occassionally getting killed by heart. Reason for
termination "heart-beat time-out". Sometimes after a couple of hours but
sometimes in few minutes. Pls note this is without any workload on this Erl
node. All the server cores are 98% free.

2. I started without heart and ran 2 recursive functons, 1 spawned and 1 in
the Emulator prompt. This function outputs time every 1 second. When this
issue occurs,the output stops. Which means the recursive functions stop
working. Since heart is not started, Erl node is not getting restarted.

2.1. If I leave the Erl node for sometime, the sometimes things comes back
to normal (tested for after 15-20 mins. othertimes never recovered). The
outputs starts again.

2.2. I can do net:ping/1 from another Erl node. If I do a RPC call from
another Erl node, it works. However if I run a recursive function, it runs
once and getting stuck. The RPC call waits till timeout.

2.3 When this occurs, the connections from other Erl nodes are getting
connection timeout and connection removes. However, like said above I can
still do net:ping/1 and connect. If no activity done, again timeouts a
little later.

2.4 I used etop to check whats going on. I hav a SCTP based application
running and if a SCTP message comes, it's getting handled without any
problem. I can see the recursive functions running and waiting and
timer:sleep(1000) clause, but there's no increment in "Reductions" counter
or no time output.

3. I have multiple Erlang releases running in this server. All of them face
this issue, but not all at once. Their issue comes in different times and no
pattern could observe. Even pure Mnesia DB Erl nodes face the same issue.

4. I tried with SMP enabled/disabled, different +S <values>, but nothing
works.

5. I tried with RH Linux 5.0/5.1, Erlang 12B-2/12B-3 but still the same.

6. The same Erl release in a different HP 4 core machine in same RH Linux
works fine.

Is there anybody who faced/face similar issues? What could be the cause for
this? If there's any other debug dumps I should take, let me know. I tried
everything and spent about a week, without any luck. I urgently need to find
and fix the issue.

Any advice is valuable.

Thanks,
- Eranga
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080901/7b89106b/attachment.htm>


More information about the erlang-questions mailing list