Erlang Latency Guide

Sun Nov 15 21:40:01 CET 2009

Hi!

I enjoyed a chat with Patrik Nyblom during the EUC. We talked a bit about
the scheduler and reducing latency in Erlang systems. I've written a
fairly complex
system in Erlang that requires low latencies. During the development I learned
a few tricks on how to achieve low latencies in Erlang.

This weekend I decided to write down my experiences and this resulted in the
Erlang Latency Guide, http://rigtorp.se/latency.html.

Here is a plain text version:

Erlang Latency Guide

Introduction

Latency is a tricky subject, sometimes it's not even clear what or how to
measure it. I've had the experience of writing a fairly complex system
requiring low latencies in Erlang. Fortunately Erlang provides really good
baseline performance. Most of the time you simply write your program and it
will perform well. There are however a few tricks that can be used to lower the
latencies of a specific path in the system. This document describes a few of
these tricks.

Yield

Erlang allows you to design efficient concurrent systems without caring how
processes are scheduled or how many cores the system is running on. When
running Erlang with multiple schedulers (generally one per CPU-core) the
runtime will balance the load between the schedulers by migrating processes to
starved schedulers. There is no way to bind processes to schedulers or control
how processes are migrated between schedulers. This introduces a
non-deterministic behavior in the system and makes it hard to control latency.

A common pattern is to have a demultiplexer that receives a message, sends it
to some other process/processes and then performs some additional processing on
the message:

loop(State) ->
    receive
        Msg ->
            Pid = lookup_pid(Msg, State),
            Pid ! Msg,
            State2 = update_state(Msg, State),
            loop(State2)
    end.

After the message has been sent the receiving process will be ready to execute,
but unless the receiving process is on a different scheduler the demultiplexer
will first finish executing. Ideally we would bind the demultiplexer to one
scheduler and bind the receiving processes to the other schedulers, but that's
not allowed in Erlang.

Erlang provides only one simple, but powerful way to control scheduling: The
built-in function (BIF) erlang:yield/0 lets processes voluntarily give up
execution and let other processes get a chance to execute.

The demultiplexer pattern can be modified by adding erlang:yield() after
sending the message:

loop(State) ->
    receive
        Msg ->
            Pid = lookup_pid(Msg, State),
            Pid ! Msg,
            erlang:yield(),
            State2 = update_state(Msg, State),
            loop(State2)
    end.

After the message has been sent the demultiplexer will give up execution. If
the demultiplexer and the receiver are on the same scheduler the receiver will
execute before the demultiplexer finishes executing, if they are on different
schedulers they will execute in parallel.

Using the erlang:yield/0 BIF it's possible to control the scheduling of Erlang
processes. If used correctly this can reduce the latency in a system.

Network

All network I/O in Erlang is implemented as an Erlang driver. The driver is
interfaced by the module prim_inet which in turn is interfaced by the network
related modules in the kernel application.

There is a performance issue with the prim_inet:send/2 and prim_inet:recv/2
functions affecting all the network related modules. When calling
prim_inet:send/2 or prim_inet:recv/2 the process will do a selective receive.
If the process's message queue is long there will be a performance penalty from
doing this selective receive.

For receiving there is a simple solution to this problem: use the {active,
once} socket option.

A simple selective receive-free TCP receiver:

loop(Sock) ->
    inet:setopts(Sock, [{active, once}]),
    receive
        {tcp, Sock, Data} ->
            loop(Sock);
        {tcp_error, Sock, Reason} ->
            exit(Reason);
        {tcp_closed, Sock} ->
            exit()
    end.

To implement sending without doing a selective receive it is necessary to use
the low-level port interface function erlang:port_command/2. Calling
erlang:port_command(Sock, Data) on a TCP socket would send the data Data on the
socket and return a reference Ref. The socket will reply by sending
{inet_reply, Ref, Status} to the process that called erlang:port_command.

A simple selective receive-free TCP writer:

loop(Sock) ->
    receive
        {inet_reply, _, ok} ->
            loop(Sock);
        {inet_reply, _, Status} ->
            exit(Status);
        Msg ->
            try erlang:port_command(Sock, Msg)
            catch error:Error -> exit(Error)
            end,
            loop(Sock)
    end.

Though not Erlang specific it is important to remember to tune the send and
receive buffer sizes. If the TCP receive window is full data may be delayed up
to one network round trip. For UDP, packets will be dropped.

Distribution

Erlang allows you to send messages between processes at different nodes on the
same or different computers. It is also possible to interact with C-nodes
(Erlang nodes implemented in C). The communication is done over TCP/IP and
obviously this introduces latencies, especially when communicating between
nodes on a network.

Even when the nodes are running on the same computer they communicate using TCP
/IP over the loopback interface. Different operating systems have widely
different loopback performance (Solaris has lower latency than Linux). If your
system uses the loopback interface it's a good idea to consider this.

Further Reading

  â€¢ erts/preloaded/src/prim_inet.erl from the Erlang release
  â€¢ erts/emulator/drivers/common/inet_drv.c from the Erlang release

CC â€¢ written by Erik Rigtorp