[erlang-questions] Sudden death of Erlang Node
Eranga Udesh
casper2000a@REDACTED
Sat Jan 20 03:32:35 CET 2007
Thanks for the info.
I've also experienced the File Descriptor issue a long time back. I am
convinced that occurred due to an exhaustion of CPU at that time. I ran my
system in Dual CPU machine on Erlang 10B, so it actually ran only in a
single OS Process, so occupying a single CPU. I was using RH 3.0 and due to
OS native IOWait issue it RH has, IOWait took almost 100% of the CPU and
suddenly the connectivity from external nodes or pipe stopped. However all
the local node processing as well as connectivity with already connected
nodes worked without any issues.
* * *
I use RH 3.0 release 3 and its TOP utility shows individual CPU processing
utilization. So when I said 25%, it's the utilization of that CPU which had
the maximum utilization, but all others utilization were lower than that.
* * *
I think I didn't say a single OS Thread, but a single OS Process. I use +A
option and the Erl node is started with +A192, so the whole node runs in 192
OS Threads. However since the -smp flag is not used, it only runs in a
single OS Process with 192 OS Threads.
Thanks,
- Eranga
-----Original Message-----
From: Valentin Micic [mailto:valentin@REDACTED]
Sent: Friday, January 19, 2007 8:22 PM
To: Serge Aleynikov; Eranga Udesh
Cc: erlang-questions@REDACTED
Subject: Re: [erlang-questions] Sudden death of Erlang Node
Serge Aleynikov wrote:
>
> What caused the emulator to close that file descriptor (aside from
> memory exhaustion) is something that have kept bothering me for a while,
>
Quite some time ago I've been asking a similar question in a slightly
different context: in my particular case, an Erlang node running R9 would
close a listening socket (file descriptor), that was advertised via epmd,
with consequence that nobody from outside could connect to the node. Node
itself would crunch its numbers happily away. Interestingly, this was
happening at the same time every day, always on the same node -- enough for
us to conclude that it had to be network related + particular OS patch level
helped with lunar phases... out of desperation, we complied run-time for
this particular OS patch level, using newer version of complier, and, to my
surprise, problem hasn't occurred since. Out of curiosity, does your
run-time reports to stdout something like: "driver went away without
deselecting..." or some similar phrase?
* * *
On the other hand, Frederik noticed something very valid: 25% on quad CPU
machine is 100% of a single CPU. Depending on a particular OS version,
kernel may schedule beam always on a single CPU, and when this happens,
heart process may not receive it's heartbeat on time...
* * *
What's your disk I/O like? I've noticed a very strange behaviour on beams
started with a single thread (i.e. without +A n option) and running dets
intensive applications. Under heavy traffic beam spends to much time waiting
for I/O, thus delaying process scheduling and message processing. We had
such a situation (a huge mnesia database spread over multiple dets files
with relatively high I/O), and we solved it by starting additional threads.
On pre-SMP Erlang, thread pool was used to support port drivers (including
disk I/O), thus enabling "main" thread to run scheduling even when disk is
busy. However, if you running 32-bit Erlang, do not get carried away with
number of threads, because you could easily run out of memory.
V.
More information about the erlang-questions
mailing list