busy_dist_port: Who's seeing it? (Heh, who's bothered to look?)

Mon Feb 22 22:29:38 CET 2010

In my searches of the erlang-questions postings that I've archived since
2003, there are zero messages that mention the 'busy_dist_port'
phenomenon.

See docs for erlang:system_monitor() for how to subscribe to
'busy_dist_port' and other VM events, such as long garbage collection
pauses.

I'll be sending a patch to erlang-patches shortly that would change the
ERTS_DE_BUSY_LIMIT constant to something configurable.  I'm very curious
if anyone has other experience with 'busy_dist_port' events that play
havoc with latencies in an inter-node-communication-heavy environment.

Background:

Say that I've got two Erlang nodes, running reasonably-modern multi-core
x86_64 bit CPUs.  There are two Erlang processes on each node,
communicating with each other.

        Node foo@REDACTED      "the net"        Node foo@REDACTED
        ----------                       ----------
                <- one TCP connection ->
        Proc A                           Proc X
        Proc B                           Proc Y
	Proc C                           Proc Z
        ... many more                    ... many more

Buried within the VM is a hardcoded constant called ERTS_DE_BUSY_LIMIT
that limits the number of bytes that may be queued for the Erlang
communication TCP socket between nodes.  (More specifically, the Erlang
port responsible for all communication between foo@REDACTED and foo@REDACTED).  It's
hardcoded at 128KB.

I've got an app where something like this happens on Proc A:
    1. receive hunks of data over time, buffering it in its private state
    2. wait for some event (e.g. file:sync(), some event from the 'net)
    3. send all the data over to Proc X via gen_server:cast(), one hunk
       per cast call.

My problem is that the problem is that step #3 may have several
megabytes of data to send to Proc X.  Each time the 128KB limit is hit,
Proc A is de-scheduled via by a 'busy_dist_port' event.  It isn't
runnable again until the distribution port is no longer busy.

This causes really weird latency problems when the outside world,
e.g. proc P on node bar@REDACTED, tries a synchronous gen_server:call() to Proc
A.  Huge variations of latency (jitter), sometimes many seconds worth
(and *very* unpredictable).  

Bogus timeouts make Proc P very unhappy.  Proc P is a "policeman", and
if the policeman notices bad behavior, then it Calls Upon The Law and
then Smites The Rogue Proc A For Unseemly (In)Activity That Looks Quite
Like A Crash(*).

    But what if Proc A sent all the hunks to Proc B using only one
    gen_server:cast()?  That doesn't solve much: if Proc A and Proc B
    are doing the same thing, then a single > 128KB cast by Proc A will
    still block Proc B's cast.  Things only get likely with more procs
    sending to the same node.

-Scott

(*) For verily, the scriptures tell us, in an asynchronous network, you
cannot tell the difference between a crashed peer and a merely very slow
peer.