R13B02 on 8/16 core box: all TCP communication hangs/frozen

Wed Nov 18 13:24:11 CET 2009

Good morning, all.  I posted a more detailed message over on the
erlang-bugs mailing list but would like to reach a wider audience.  I've
encountered a rare, intermittent bug that has been driving me nuts to
fix and is about to cause an important customer to be, er, extremely
irritated.  If you're not interested in bug reports, sorry in advance
about the spam.

I forgot to mention in the erlang-bugs posting that this appears to have
hit us in a small lab (~8 machines) perhaps a handful of times in the
last several weeks.  Only today have we got some really good diagnostic
info, alas.  It appears repeatable but requires a ton of patience and
time.

-Scott

[...]

While running a distributed Erlang app on a box that probably wasn't
very busy at the time, all of its peer nodes decided that the hung box
had timed out (some whitespace and newlines flattened):

    ** Node foo@REDACTED not responding **** Removing (timedout) connection **

After running time of about 2 days, the VM appeared to lock up
completely.  Not possible to connect a remote shell.  No "strace -f -p"
output when connecting to the Erlang distribution TCP port reported by
"epmd -names", nor any output when connecting to any of the TCP listener
sockets owned by our Erlang applications inside the VM (connections via
Telnet or "nc" would open in 0-10 seconds, usually, because most app
listener sockets use a backlog size of 4096, but no sign of system call
activity by the VM).

The box in question is a few-months-old Dell (I can look up the model
number if it's helpful), dual CPU, quad core + hyperthreading (or
whatever Intel calls it today) for 16 virtual cores.  If I start an
Erlang shell without args:

    Erlang R13B02 (erts-5.7.3) [source] [64-bit] [smp:16:16] [rq:16] [async-threads:0] [hipe] [kernel-poll:false]

    Eshell V5.7.3  (abort with ^G)
    (slf21203@REDACTED)1> 

The app is run via:

    env ERL_MAX_ETS_TABLES=10007 \
            erl \
            +A 64 +K true \
            -noinput -noshell \
            -kernel dist_auto_connect once \
            -kernel net_ticktime 20 \
            ...

I've got output from "strace", "lsof", and a little bit of GDB stack
backtraces from each Pthread.  I've got a core file of the process via
"gcore", so I can do other GDB stuff if it'd be helpful.

I will almost certainly have at least 8 metric tons of extremely unhappy
customer asking lots of anxious questions within 12-24 hours.  Unhappy
me.  If there's other info that I can provide to the OTP team or other
debugging parties, please let me know ASAP.  And many thanks in advance!

-Scott

[... about 1200 lines of text deleted ...]