[erlang-questions] strings vs binaries

Wed Aug 19 11:15:54 CEST 2015

On Tue, Aug 18, 2015 at 10:16 PM, Ben Hsu <benhsu@REDACTED> wrote:

> My question is when you will use each one. Are binary strings used for
> sending data over the wire, and normal strings used internally? what are
> the tradeoffs?

Somewhat absent from the discussion until now are the 3rd string-like type
in Erlang: atom()'s

* A string() is a linked list of unicode code points on the process heap.
* A binary() is a buffer of binary data which can contain utf-8 encoded
strings. They are either allocated on the process heap, or for large
strings, in the binary() arena in the memory allocator. This arena is
shared among all processes.
* An atom() is a name which is hashed to an integer internally and then the
integer is used in place of the name.

Caveats:

* string()'s are easy to manipulate with a lot of functions, but their
memory imprint is large. This makes them unsuitable for storing large
amounts of data as strings, if that data is in-memory. They are usually
fine for smaller things, and chunked flow however.
* binary() data has problems if you reference subbinaries and those keep
the underlying binary around. This is where `binary:copy/1` comes into the
game. The general binary handling documentation in the performance guide
linked here has the details of binaries, subbinaries, heap binaries and
match contexts.
* atom()'s have very fast equality checks. But the atom table is limited in
size, you you can't have a program which dynamically creates new atoms. In
many cases however, creating a mapping from external data to internal atoms
yields very efficient programs.

The general rule from my blog post still holds true to this day: it is
often better to internalize external data into a symbolic format. A
language such as Erlang is efficient when it operates on symbolic data:

intern(<<"fnord">>) -> {ok, fnord};
intern(<<"frobnificatization">>) -> {ok, frobnificatization};
intern(Unknown) -> {error, {cannot_intern, Unknown}}.

i(X) ->
    {ok, A} = intern(X),
    A.

and so on. Of course, if the string is truly a random string of information
for humans to read, you can't parse it, but have to lug it around as
satellite data in your application. And then the size caveat of string()'s
has to be taken into account for some programs which keep a large set of
data in-memory.

(Note: This is without discussing the problem of handling unicode data,
which other people covered well already)

-- 
J.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150819/64d3e51d/attachment.htm>