busy_dist_port: Who's seeing it? (Heh, who's bothered to look?)
Scott Lystig Fritchie
fritchie@REDACTED
Mon Feb 22 22:29:38 CET 2010
In my searches of the erlang-questions postings that I've archived since
2003, there are zero messages that mention the 'busy_dist_port'
phenomenon.
See docs for erlang:system_monitor() for how to subscribe to
'busy_dist_port' and other VM events, such as long garbage collection
pauses.
I'll be sending a patch to erlang-patches shortly that would change the
ERTS_DE_BUSY_LIMIT constant to something configurable. I'm very curious
if anyone has other experience with 'busy_dist_port' events that play
havoc with latencies in an inter-node-communication-heavy environment.
Background:
Say that I've got two Erlang nodes, running reasonably-modern multi-core
x86_64 bit CPUs. There are two Erlang processes on each node,
communicating with each other.
Node foo@REDACTED "the net" Node foo@REDACTED
---------- ----------
<- one TCP connection ->
Proc A Proc X
Proc B Proc Y
Proc C Proc Z
... many more ... many more
Buried within the VM is a hardcoded constant called ERTS_DE_BUSY_LIMIT
that limits the number of bytes that may be queued for the Erlang
communication TCP socket between nodes. (More specifically, the Erlang
port responsible for all communication between foo@REDACTED and foo@REDACTED). It's
hardcoded at 128KB.
I've got an app where something like this happens on Proc A:
1. receive hunks of data over time, buffering it in its private state
2. wait for some event (e.g. file:sync(), some event from the 'net)
3. send all the data over to Proc X via gen_server:cast(), one hunk
per cast call.
My problem is that the problem is that step #3 may have several
megabytes of data to send to Proc X. Each time the 128KB limit is hit,
Proc A is de-scheduled via by a 'busy_dist_port' event. It isn't
runnable again until the distribution port is no longer busy.
This causes really weird latency problems when the outside world,
e.g. proc P on node bar@REDACTED, tries a synchronous gen_server:call() to Proc
A. Huge variations of latency (jitter), sometimes many seconds worth
(and *very* unpredictable).
Bogus timeouts make Proc P very unhappy. Proc P is a "policeman", and
if the policeman notices bad behavior, then it Calls Upon The Law and
then Smites The Rogue Proc A For Unseemly (In)Activity That Looks Quite
Like A Crash(*).
But what if Proc A sent all the hunks to Proc B using only one
gen_server:cast()? That doesn't solve much: if Proc A and Proc B
are doing the same thing, then a single > 128KB cast by Proc A will
still block Proc B's cast. Things only get likely with more procs
sending to the same node.
-Scott
(*) For verily, the scriptures tell us, in an asynchronous network, you
cannot tell the difference between a crashed peer and a merely very slow
peer.
More information about the erlang-questions
mailing list