net_kernel hang, perhaps blocked by busy_dist_port race?

Sat May 15 23:08:51 CEST 2010

Hi, all.  We've been bitten by a rather mysterious bug that has
disrupted Erlang message passing on roughly 10% of all nodes in a 100+
node cluster.  The same thing happened on 10 nodes within a 2-3 second
time window.  No further communication with the affected nodes via
Erlang message passing is possible.

For details, see a post by the same subject to the erlang-bugs list.
R13B04 on x86-64 Linux boxes.  Steps #1-7 did indeed happen in that
order, there's little doubt, thanks to some chatty app logging.

-Scott

P.S. For those of you still interested, here's the intro to the
erlang-bugs posting.

I'm wondering if there's a possible race condition when two nodes A
and Z are communicating with each other, like this:

   1. Z makes a bunch of RPCs to A.
   2. A starts sending RPC replies to Z.
   3. Z decides to behave erratically, cause unknown.
   4. A's TCP connection to Z becomes "busy", probably because Z
      cannot or will not read data on the A <-> Z TCP connection.
   5. All processes on A that are trying to reply to Z are blocked and
      unscheduled; 'busy_dist_port' messages are generated for all of
      them. 
   6. The 'net_kernel' process on A is one of the procs blocked by the
      'busy_dist_port' events.
   7. A's connection to Z is broken.  The system message reported is:
      [{nodedown_reason,connection_closed},{node_type,visible}]

... and then A's 'net_kernel' process remains blocked forever?  Or is
alive but isn't working correctly?  "erl -sname tmp$$ -remsh app@REDACTED"
will fail, for example.