[erlang-questions] strings vs binaries

Wed Aug 19 11:44:20 CEST 2015

Someone did some benchmarking on this a few years ago to compare the speed
of using binary utf-8 encoded strings vs list strings. They arrived at the
for them surprising result that using list strings was actually faster. It
is actually not surprising if you think about it. So what Richard says
definitely very relevant:

Always, you need to ask "What am I going to do with this?"

Robert

On 19 August 2015 at 10:15, Jesper Louis Andersen <
jesper.louis.andersen@REDACTED> wrote:

>
> On Tue, Aug 18, 2015 at 10:16 PM, Ben Hsu <benhsu@REDACTED> wrote:
>
>> My question is when you will use each one. Are binary strings used for
>> sending data over the wire, and normal strings used internally? what are
>> the tradeoffs?
>
>
> Somewhat absent from the discussion until now are the 3rd string-like type
> in Erlang: atom()'s
>
> * A string() is a linked list of unicode code points on the process heap.
> * A binary() is a buffer of binary data which can contain utf-8 encoded
> strings. They are either allocated on the process heap, or for large
> strings, in the binary() arena in the memory allocator. This arena is
> shared among all processes.
> * An atom() is a name which is hashed to an integer internally and then
> the integer is used in place of the name.
>
> Caveats:
>
> * string()'s are easy to manipulate with a lot of functions, but their
> memory imprint is large. This makes them unsuitable for storing large
> amounts of data as strings, if that data is in-memory. They are usually
> fine for smaller things, and chunked flow however.
> * binary() data has problems if you reference subbinaries and those keep
> the underlying binary around. This is where `binary:copy/1` comes into the
> game. The general binary handling documentation in the performance guide
> linked here has the details of binaries, subbinaries, heap binaries and
> match contexts.
> * atom()'s have very fast equality checks. But the atom table is limited
> in size, you you can't have a program which dynamically creates new atoms.
> In many cases however, creating a mapping from external data to internal
> atoms yields very efficient programs.
>
> The general rule from my blog post still holds true to this day: it is
> often better to internalize external data into a symbolic format. A
> language such as Erlang is efficient when it operates on symbolic data:
>
> intern(<<"fnord">>) -> {ok, fnord};
> intern(<<"frobnificatization">>) -> {ok, frobnificatization};
> intern(Unknown) -> {error, {cannot_intern, Unknown}}.
>
> i(X) ->
>     {ok, A} = intern(X),
>     A.
>
> and so on. Of course, if the string is truly a random string of
> information for humans to read, you can't parse it, but have to lug it
> around as satellite data in your application. And then the size caveat of
> string()'s has to be taken into account for some programs which keep a
> large set of data in-memory.
>
> (Note: This is without discussing the problem of handling unicode data,
> which other people covered well already)
>
> --
> J.
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150819/8944f34b/attachment.htm>