[erlang-questions] R13B02 on 8/16 core box: all TCP communication hangs/frozen

Luke Gorrie luke@REDACTED
Wed Nov 18 15:28:18 CET 2009


Hi Scott!

I suppose you knew the risk of idle speculation from the peanut
gallery when you posted this to erlang-questions..

Can you clarify the timeline? First event (time T0) is foo@REDACTED being
detected as "not responding" by its peers (i.e. doesn't answer TCP
heartbeat) and second event is seeing that the emulator is totally
wedged (time T1). Is it two days between T0 and T1? can you
characterise the operational state of the machine in between?

lsof shows not many open files / nothing scary.

gdb shows the first thread in select() and the rest blocking on a
pthreads event. Consistent with (a) normal operation (b) deadlock
inside the emulator (c) screwed up file descriptor set or timeout in
select(). I suppose (c) at least could be ruled out with gdb'ery.

But it seems like the most interesting tidbit is:

> connections via Telnet or "nc" would open in 0-10 seconds, usually,
> because most app listener sockets use a backlog size of 4096, but no
> sign of system call activity by the VM

Are you saying that it sometimes takes several seconds to establish a
loopback socket connection from telnet or netcat? That sounds
extremely fishy! The kernel (not BEAM) is the one responsible for
getting the socket to ESTABLISHED state and if that doesn't happen
within a few milliseconds then it sounds like your kernel is
performing very badly.

Can the kernel possibly be very busy? (I watch 'vmstat 1' to check.)
If it is busy you could run e.g. oprofile to find out where. One
common cause is "too many open <something>" hitting a bad performance
case in a kernel data structure. Can be files, sockets, routing
entries, iptables conntrack entries, etc (oprofile should make it easy
to see which).

If you still have the machine wedged I'd be curious to see a page of
'vmstat 1' output and also the full set of open sockets ('netstat
-tlnp') and anything else that might be overloaded (are you using e.g.
advanced routing or filewalling features?)

If you want a sounding board let me know and I'll tell you my mobile number.

Cheers,
-Luke (damn I miss these problems!)

P.S. If the list does NOT get two copies of this mail then it's
annoyingly hard to post via Gmane.


More information about the erlang-questions mailing list