net_kernel hang, perhaps blocked by busy_dist_port race?
Scott Lystig Fritchie
fritchie@REDACTED
Sat May 15 23:08:51 CEST 2010
Hi, all. We've been bitten by a rather mysterious bug that has
disrupted Erlang message passing on roughly 10% of all nodes in a 100+
node cluster. The same thing happened on 10 nodes within a 2-3 second
time window. No further communication with the affected nodes via
Erlang message passing is possible.
For details, see a post by the same subject to the erlang-bugs list.
R13B04 on x86-64 Linux boxes. Steps #1-7 did indeed happen in that
order, there's little doubt, thanks to some chatty app logging.
-Scott
P.S. For those of you still interested, here's the intro to the
erlang-bugs posting.
I'm wondering if there's a possible race condition when two nodes A
and Z are communicating with each other, like this:
1. Z makes a bunch of RPCs to A.
2. A starts sending RPC replies to Z.
3. Z decides to behave erratically, cause unknown.
4. A's TCP connection to Z becomes "busy", probably because Z
cannot or will not read data on the A <-> Z TCP connection.
5. All processes on A that are trying to reply to Z are blocked and
unscheduled; 'busy_dist_port' messages are generated for all of
them.
6. The 'net_kernel' process on A is one of the procs blocked by the
'busy_dist_port' events.
7. A's connection to Z is broken. The system message reported is:
[{nodedown_reason,connection_closed},{node_type,visible}]
... and then A's 'net_kernel' process remains blocked forever? Or is
alive but isn't working correctly? "erl -sname tmp$$ -remsh app@REDACTED"
will fail, for example.
More information about the erlang-questions
mailing list