[erlang-questions] Erl getting stuck in SMP with 8 cores

Eranga Udesh eranga.erl@REDACTED
Mon Sep 1 04:41:20 CEST 2008


Edwin,

Yes, it's built from source. Tried by doing "make clean" as well. I enabled
smp, threads, etc. The compiler is gcc. The node is started with smp
enabled, 196 OS threads and hipe. I tried Kernal poll enabled and disabled,
but both the times same issue. Platform is 32bit.

Latest firmware for the machine and OS patches are installed. There're 3
same config machines and every machine gives the same Erlang issue. Like I
said, it's not all the Erlang nodes getting the issue at the same time. I
wonder if it's a Erl scheduler blockage or something. Also there's no any
real time process running that I could think if causing this. The CPUs are
almost 98% free.

Like I said earlier, the Erl node is not totall stucked. Single RPC commands
work but recursive functions getting stuck. Running processes halts and
"Reductions" don't increase. The Erl prompt is not responding when doing
"to_erl"
If you or anyboby run or tested Erlang with 8 or more Intel cores, pls let
me know. I want to first narrow down the problem to identify if it's a
problem only in my instalation or an Erl smp problem. If it's the
former,much releaved.

Thanks,
- Eranga



On Mon, Sep 1, 2008 at 1:38 AM, Edwin Fine
<erlang-questions_efine@REDACTED>wrote:

> Eranga,
>
> You didn't say whether or not you had built the Erlang releases from
> source. If not, it might be a good idea to try that on the target system
> (./configure; make clean; make). If so, which compiler are you using?
>
> I won't ask all the obvious non-Erlang questions like, do you have the
> latest firmware for the machine, latest patches for RH 5.1, have you run a
> full hardware/memory test suite (maybe the machine is flaky), have you
> stopped all other non-critical processes and applications (maybe there's a
> rogue real-time process?) etc.
>
> You also didn't say if you were running 32-bit or 64-bit Linux.
>
> 2008/8/31 Eranga Udesh <eranga.erl@REDACTED>
>
>>   Hi,
>>
>> I recently installed a new Erlang release 12B-3 on a 8 core (2 x 4 core
>> Intel Zeon) HP DL 580 G5 machine running RH Linux 5.1. However this gives
>> misterious errors in process handling. The observations are as below.
>>
>> 1. The Erlang node occassionally getting killed by heart. Reason for
>> termination "heart-beat time-out". Sometimes after a couple of hours but
>> sometimes in few minutes. Pls note this is without any workload on this Erl
>> node. All the server cores are 98% free.
>>
>> 2. I started without heart and ran 2 recursive functons, 1 spawned and 1
>> in the Emulator prompt. This function outputs time every 1 second. When this
>> issue occurs,the output stops. Which means the recursive functions stop
>> working. Since heart is not started, Erl node is not getting restarted.
>>
>> 2.1. If I leave the Erl node for sometime, the sometimes things comes back
>> to normal (tested for after 15-20 mins. othertimes never recovered). The
>> outputs starts again.
>>
>> 2.2. I can do net:ping/1 from another Erl node. If I do a RPC call from
>> another Erl node, it works. However if I run a recursive function, it runs
>> once and getting stuck. The RPC call waits till timeout.
>>
>> 2.3 When this occurs, the connections from other Erl nodes are getting
>> connection timeout and connection removes. However, like said above I can
>> still do net:ping/1 and connect. If no activity done, again timeouts a
>> little later.
>>
>> 2.4 I used etop to check whats going on. I hav a SCTP based application
>> running and if a SCTP message comes, it's getting handled without any
>> problem. I can see the recursive functions running and waiting and
>> timer:sleep(1000) clause, but there's no increment in "Reductions" counter
>> or no time output.
>>
>> 3. I have multiple Erlang releases running in this server. All of them
>> face this issue, but not all at once. Their issue comes in different times
>> and no pattern could observe. Even pure Mnesia DB Erl nodes face the same
>> issue.
>>
>> 4. I tried with SMP enabled/disabled, different +S <values>, but nothing
>> works.
>>
>> 5. I tried with RH Linux 5.0/5.1, Erlang 12B-2/12B-3 but still the same.
>>
>> 6. The same Erl release in a different HP 4 core machine in same RH Linux
>> works fine.
>>
>> Is there anybody who faced/face similar issues? What could be the cause
>> for this? If there's any other debug dumps I should take, let me know. I
>> tried everything and spent about a week, without any luck. I urgently need
>> to find and fix the issue.
>>
>> Any advice is valuable.
>>
>> Thanks,
>> - Eranga
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>
>
>
>
> --
> For every expert there is an equal and opposite expert - Arthur C. Clarke
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080901/7ec1ebef/attachment.htm>


More information about the erlang-questions mailing list