[erlang-questions] Erl getting stuck in SMP with 8 cores

Edwin Fine erlang-questions_efine@REDACTED
Mon Sep 1 05:58:59 CEST 2008


Sorry, I only have a 4-core Q6600.

Some friendly suggestions.

When looking for help, you should post as much *relevant* information as you
can, such as the exact versions of your gcc compiler, glibc, Linux kernel
(cat /proc/sys/kernel/osrelease), and if possible, *some simple Erlang code
to reproduce the problem*. Show the command line you use to start Erlang and
the first line of the shell response containing the Erlang version info.
Maybe even attach your Linux system configuration to the post if it's not
too big (sysctl -a > somefile.txt). Then the Erlang folks or someone else
might be able to help you effectively.

These are some general tips that may or may not help.

   - Have you compared the system configuration on the 4-core box (sysctl
   -a) with the one on the 8-core box? Maybe there is some silly limit that's
   too low.
   - Are you really running a 32 bit OS on a dual 4-core Xeon? How much
   memory is installed? If it's more than 3 - 4 GB it's being wasted.
   - Is the OS running under Xen or some other hypervisor? If so, could
   there be some interaction there? I have run Erlang R12B-2 under Xen 3.1.0 on
   top of OpenSuse 2.6.22 x86_64 with no problems, so it's unlikely, but if Xen
   is in the picture you should try booting on the raw OS anyway.
   - Are there any strange entries in the system log that might provide a
   clue?


I regret that I have no further ideas on what the issue could be. The fact
that things run fine on the 4-core HP box with the same version of Linux is
troubling, but there are so many variables involved I can't draw any
conclusions from this.

One last thing: I did notice in the R12B-3 release notes that there is a
known bug in gcc 4.3.0, but your build would have failed if you had that
compiler version. Unless you fiddled with the configure script... ;-)

    OTP-7397  The configure script now tests for an serious optimization
	      bug in gcc-4.3.0. If the bug is present, the configure script
	      will abort (if this happens, the only way to build Erlang/OTP
	      is to change to another version of gcc). (Thanks to Mikael
	      Pettersson.)




On Sun, Aug 31, 2008 at 10:41 PM, Eranga Udesh <eranga.erl@REDACTED> wrote:

> Edwin,
>
> Yes, it's built from source. Tried by doing "make clean" as well. I enabled
> smp, threads, etc. The compiler is gcc. The node is started with smp
> enabled, 196 OS threads and hipe. I tried Kernal poll enabled and disabled,
> but both the times same issue. Platform is 32bit.
>
> Latest firmware for the machine and OS patches are installed. There're 3
> same config machines and every machine gives the same Erlang issue. Like I
> said, it's not all the Erlang nodes getting the issue at the same time. I
> wonder if it's a Erl scheduler blockage or something. Also there's no any
> real time process running that I could think if causing this. The CPUs are
> almost 98% free.
>
> Like I said earlier, the Erl node is not totall stucked. Single RPC
> commands work but recursive functions getting stuck. Running processes halts
> and "Reductions" don't increase. The Erl prompt is not responding when doing
> "to_erl"
>  If you or anyboby run or tested Erlang with 8 or more Intel cores, pls
> let me know. I want to first narrow down the problem to identify if it's a
> problem only in my instalation or an Erl smp problem. If it's the
> former,much releaved.
>
> Thanks,
> - Eranga
>
>
>
> On Mon, Sep 1, 2008 at 1:38 AM, Edwin Fine <erlang-questions_efine@REDACTED
> > wrote:
>
>> Eranga,
>>
>> You didn't say whether or not you had built the Erlang releases from
>> source. If not, it might be a good idea to try that on the target system
>> (./configure; make clean; make). If so, which compiler are you using?
>>
>> I won't ask all the obvious non-Erlang questions like, do you have the
>> latest firmware for the machine, latest patches for RH 5.1, have you run a
>> full hardware/memory test suite (maybe the machine is flaky), have you
>> stopped all other non-critical processes and applications (maybe there's a
>> rogue real-time process?) etc.
>>
>> You also didn't say if you were running 32-bit or 64-bit Linux.
>>
>> 2008/8/31 Eranga Udesh <eranga.erl@REDACTED>
>>
>>>   Hi,
>>>
>>> I recently installed a new Erlang release 12B-3 on a 8 core (2 x 4 core
>>> Intel Zeon) HP DL 580 G5 machine running RH Linux 5.1. However this gives
>>> misterious errors in process handling. The observations are as below.
>>>
>>> 1. The Erlang node occassionally getting killed by heart. Reason for
>>> termination "heart-beat time-out". Sometimes after a couple of hours but
>>> sometimes in few minutes. Pls note this is without any workload on this Erl
>>> node. All the server cores are 98% free.
>>>
>>> 2. I started without heart and ran 2 recursive functons, 1 spawned and 1
>>> in the Emulator prompt. This function outputs time every 1 second. When this
>>> issue occurs,the output stops. Which means the recursive functions stop
>>> working. Since heart is not started, Erl node is not getting restarted.
>>>
>>> 2.1. If I leave the Erl node for sometime, the sometimes things comes
>>> back to normal (tested for after 15-20 mins. othertimes never recovered).
>>> The outputs starts again.
>>>
>>> 2.2. I can do net:ping/1 from another Erl node. If I do a RPC call from
>>> another Erl node, it works. However if I run a recursive function, it runs
>>> once and getting stuck. The RPC call waits till timeout.
>>>
>>> 2.3 When this occurs, the connections from other Erl nodes are getting
>>> connection timeout and connection removes. However, like said above I can
>>> still do net:ping/1 and connect. If no activity done, again timeouts a
>>> little later.
>>>
>>> 2.4 I used etop to check whats going on. I hav a SCTP based application
>>> running and if a SCTP message comes, it's getting handled without any
>>> problem. I can see the recursive functions running and waiting and
>>> timer:sleep(1000) clause, but there's no increment in "Reductions" counter
>>> or no time output.
>>>
>>> 3. I have multiple Erlang releases running in this server. All of them
>>> face this issue, but not all at once. Their issue comes in different times
>>> and no pattern could observe. Even pure Mnesia DB Erl nodes face the same
>>> issue.
>>>
>>> 4. I tried with SMP enabled/disabled, different +S <values>, but nothing
>>> works.
>>>
>>> 5. I tried with RH Linux 5.0/5.1, Erlang 12B-2/12B-3 but still the same.
>>>
>>> 6. The same Erl release in a different HP 4 core machine in same RH Linux
>>> works fine.
>>>
>>> Is there anybody who faced/face similar issues? What could be the cause
>>> for this? If there's any other debug dumps I should take, let me know. I
>>> tried everything and spent about a week, without any luck. I urgently need
>>> to find and fix the issue.
>>>
>>> Any advice is valuable.
>>>
>>> Thanks,
>>> - Eranga
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>
>>
>>
>>
>> --
>> For every expert there is an equal and opposite expert - Arthur C. Clarke
>>
>
>


-- 
For every expert there is an equal and opposite expert - Arthur C. Clarke
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080831/38202ee6/attachment.htm>


More information about the erlang-questions mailing list