[erlang-questions] R13B02 on 8/16 core box: all TCP communication hangs/frozen

Thu Nov 19 03:56:21 CET 2009

Luke Gorrie <luke@REDACTED> wrote:

lg> I suppose you knew the risk of idle speculation from the peanut
lg> gallery when you posted this to erlang-questions..

Yes, yes, though perhaps posting only an abridged message here on
erlang-questions was a mistake?  {shrug}

lg> Can you clarify the timeline? First event (time T0) is foo@REDACTED being
lg> detected as "not responding" by its peers (i.e. doesn't answer TCP
lg> heartbeat) and second event is seeing that the emulator is totally
lg> wedged (time T1). Is it two days between T0 and T1?

More like 2 seconds, or less maybe.

lg> lsof shows not many open files / nothing scary.

... except that it should have had many more.  The weird thing is that
"lsof" didn't know how to identify the last dozen or so fds.  There
isn't an "lsof" package vs. kernel mismatch, I wonder if the app did a
legitimate system call (but not close(2)) that managed to put those fds
in a state where "lsof" can't decypher the kernel's data structures.

lg> gdb shows the first thread in select() and the rest blocking on a
lg> pthreads event. Consistent with (a) normal operation (b) deadlock
lg> inside the emulator (c) screwed up file descriptor set or timeout in
lg> select(). I suppose (c) at least could be ruled out with gdb'ery.

Many thanks to Rickard Green for the wondrously quick patch.  We'll
start using it today, though it has taken a week or more for a single
occurrance of this bug (or, rather, what we now suspect are all
manifestations of the same thing).

lg> But it seems like the most interesting tidbit is:

lg> Are you saying that it sometimes takes several seconds to establish
lg> a loopback socket connection from telnet or netcat? That sounds
lg> extremely fishy! [...]

Yes, I agree with your blame pointed at the kernel, but ... those
sockets have a listen backlog of 4096.  Strace shows no VM activity in
any thread when a new TCP connection is actually established.  My guess
is that when the app does not bother calling accept(2), and then the
listen backlog queue is full, the kernel will wait until one of the
existing 4096 TCP connections leaves CLOSE_WAIT state, which then leaves
an open slot in the listen queue?

lg> Can the kernel possibly be very busy? (I watch 'vmstat 1' to check.)

Somewhat busy, but not with any activity for the VM in question.  No
shortage of OS file descriptors, OS process slots, etc.

For a little while, we suspected that the problem may have been within
the VM, hitting the Erlang process limit.  We use the default of 32K
procs.  But when that's happened in the past, we've still been able to
scribble a log message via our custom gen_event handler for the SASL
event logger.  That handler doesn't keep a file handle open all the
time, instead it opens, writes, closes for each event log entry.  But
we're opening the file handle in raw mode, so unless we managed to hit
the max # of ports limit, we oughtta be OK.  (And again, past experience
suggests that we can still write the error message to the file reliably,
or at least with high probability of success.)

A peek with GDB at the # of Erlang processes, showed "142" so ... the
too-many-procs theory was flawed.

We also experimented with IP chains, to see if preventing an external
TCP load balancer from tickling certain TCP ports every few seconds.
But after waiting about 10 minutes with no change in VM behavior, we
gave up.

-Scott