[erlang-questions] Sudden death of Erlang Node

Eranga Udesh casper2000a@REDACTED
Sat Jan 20 03:32:35 CET 2007


Thanks for the info.

I've also experienced the File Descriptor issue a long time back. I am
convinced that occurred due to an exhaustion of CPU at that time. I ran my
system in Dual CPU machine on Erlang 10B, so it actually ran only in a
single OS Process, so occupying a single CPU. I was using RH 3.0 and due to
OS native IOWait issue it RH has, IOWait took almost 100% of the CPU and
suddenly the connectivity from external nodes or pipe stopped. However all
the local node processing as well as connectivity with already connected
nodes worked without any issues.

* * *

I use RH 3.0 release 3 and its TOP utility shows individual CPU processing
utilization. So when I said 25%, it's the utilization of that CPU which had
the maximum utilization, but all others utilization were lower than that.

* * *

I think I didn't say a single OS Thread, but a single OS Process. I use +A
option and the Erl node is started with +A192, so the whole node runs in 192
OS Threads. However since the -smp flag is not used, it only runs in a
single OS Process with 192 OS Threads.

Thanks,
- Eranga


-----Original Message-----
From: Valentin Micic [mailto:valentin@REDACTED] 
Sent: Friday, January 19, 2007 8:22 PM
To: Serge Aleynikov; Eranga Udesh
Cc: erlang-questions@REDACTED
Subject: Re: [erlang-questions] Sudden death of Erlang Node

Serge Aleynikov wrote:

>
> What caused the emulator to close that file descriptor (aside from
> memory exhaustion) is something that have kept bothering me for a while,
>
Quite some time ago I've been asking a similar question in a slightly 
different context: in my particular case, an Erlang node running R9 would 
close a listening socket (file descriptor), that was advertised via epmd, 
with consequence that nobody from outside could connect to the node.  Node 
itself would crunch its numbers happily away. Interestingly, this was 
happening at the same time every day, always on the same node -- enough for 
us to conclude that it had to be network related + particular OS patch level

helped with lunar phases... out of desperation, we complied run-time for 
this particular OS patch level, using newer version of complier, and, to my 
surprise, problem hasn't occurred since. Out of curiosity, does your 
run-time reports to stdout something like: "driver went away without 
deselecting..." or some similar phrase?

* * *

On the other hand, Frederik noticed something very valid: 25% on quad CPU 
machine is 100% of a single CPU. Depending on a particular OS version, 
kernel may schedule beam always on a single CPU, and when this happens, 
heart process may not receive it's heartbeat on time...

* * *

What's your disk I/O like? I've noticed a very strange behaviour on beams 
started with a single thread (i.e. without +A n option) and running dets 
intensive applications. Under heavy traffic beam spends to much time waiting

for I/O, thus delaying process scheduling and message processing. We had 
such a situation (a huge mnesia database spread over multiple dets files 
with relatively high I/O), and we solved it by starting additional threads. 
On pre-SMP Erlang, thread pool was used to support port drivers (including 
disk I/O), thus enabling "main" thread to run scheduling even when disk is 
busy. However, if you running 32-bit Erlang, do not get carried away with 
number of threads, because you could easily run out of memory.

V.





More information about the erlang-questions mailing list