[erlang-questions] Strange BEAM slowdown

Mon Feb 22 18:14:11 CET 2016

Hello,

We have an application where we read a huge volume of small messages
from ZMQ sockets and distribute them to Erlang processes.  We are
seeing strange behavior where, after a short while, beam.smp's load
drops quite a bit and then the data begins queuing, eating memory
until we either stop the program or the Linux OOM killer does it for
us.

DETAILS
-------
CentOS release 6.6 (Final)
Erlang/OTP 17 [erts-6.4] [source-2e19e2f] [64-bit] [smp:56:56] [async-threads:20] [hipe] [kernel-poll:true]

beam.smp is started with the flags: +sbt db +sub true

We have 60+ data sources (TCP/ZMQ sockets), each of which feeds an
independent set of processes; there is no interaction between the
processes handling the data from one socket and the processes handling
data from other sockets.

Our first implementation used the erlzmq2 library to read the socket.
We then parsed the messages in Erlang and sent Erlang terms to the
data handling processes.

After seeing the problem behavior we suspected that the repeated calls
to erlzmq:recv() and parsing in Erlang might be the cause of the
backup so we rewrote that code as a NIF (background thread + several
API calls).  Our NIF implementation reads the ZMQ socket, parses the
data and then sends it to the data handling processes.  We (obviously,
I suppose) create one of these background threads for each of the 60+
data source sockets.

Despite the entirely different implementation of ZMQ handling, parsing
and dispatch of the data, we are seeing the same issue: first the load
drops off precipitously and then the data starts queuing in the ZMQ
socket buffers and the program is unusable.

We are curious if anyone has seen this sort of behavior with BEAM or
might have suggestions on where to look for the issue.

Thanks,

Tim