[erlang-questions] bif_return_trap

Tue Dec 18 07:02:49 CET 2012

Paul Davis <paul.joseph.davis@REDACTED> wrote:

pd> For background, I'm running R14B01 on three nodes with one of the
pd> three nodes in a remote data center that's about 40ms away from the
pd> other two which are <1ms apart.

Hrm, well, if the link truly is 1 Gbit (with ~8% fudge for TCP/IP
overhead), 2x the bandwidth delay product is about 79 Mbits or 9647
Kbytes.  IIRC, kernel TCP settings to allow sliding windows at least
that big in order to utilize all that bandwidth by a single TCP
connection.

Ditto for the buffering inside the VM ... except that the +zdbbl flag to
"erl" wasn't added until well after R14B01's release.  I don't have the
R14B01 release date handy, but R14B02 may have been released near March
2012?  My patch for +zdbbl wasn't done for several more months:

    commit 8faf1746ece60fc5fa634e5fd16e98df1ef7f3ba
    Author: Scott Lystig Fritchie <slfritchie@REDACTED>
    Date:   Fri Oct 22 15:25:10 2010 -0500

        Add flag-based setting for the distribution buffer busy limit

pd> http://erlang.org/pipermail/erlang-bugs/2010-May/001806.html

IIRC, if you hit that one (which was also fixed after R14B01?), all
distributed Erlang communication freezes.  Attempting a new connection
via "erl [-name foo@REDACTED | -sname foo] -remsh frozen@REDACTED" won't work.

pd> What I'm observing is that the remote node ends up accumulating
pd> processes stuck in erlang:bif_return_trap/1 which eventually
pd> accumulate to the point where the node exhausts RAM and the node
pd> reboots (if I let it go that long). Each process stuck in
pd> bif_return_trap is related to distributed message passing.

Are you seeing busy_dist_port messages sent to the system monitor
process defined by erlang:system_monitor/2?

-Scott