[erlang-questions] Strings usage caveats

Mon Mar 5 22:11:44 CET 2012

On 6/03/2012, at 2:49 AM, Aleksandr Vinokurov wrote:

> I have a very poor experience in programming in Erlang/OTP so that
> sentence was rather abstract for me. I suppose that the root of the
> problems with strings is in variables immutability and thus a copying
> of the whole source string in case of its modification. But it seems
> to me that it's not that all.

No, do not suppose that.  After all, strings are immutable in Java as
well, and nobody talks about _them_ being inefficient.  (Except me,
but that's another story, and it's not related to mutability.)

In a Unicode world, there are very very very few cases where it makes
sense to change part of a string other than by
  taking a substring
  concatenating with other strings
  applying a regular expression rewrite (like the 's/old/new/' command
  in Vi(m)).

There are three representations of strings commonly used in Erlang,
and they have different tradeoffs.

(1) Linked lists of character codes.
    "abc" is the same thing as [97,98,99].

    These are simple to process from left to right.
    Dropping the first D characters takes O(D) time (Java: O(1)).
    Taking the first T characters takes O(T) time (Java: O(1)).
    Concatenating length M with length N takes O(M) time (Java: O(M+N)).

    The problem is SPACE.  Each character requires one machine word for
    the character code and another to point to the rest of the list.
    So a string of N characters requires 8N bytes in a 32-bit
    environment, or 16N bytes in a 64-bit environment.  C using UTF-8
    would probably take about 1.1N bytes for the kind of stuff I see
    these days; Java would take 2N.  Of course, to handle full Unicode
    without the kind of contortions Java requires, you'd need 4N bytes
    anyway, so 8N doesn't look as bad as it did when everyone either
    rejoiced in ASCII (or EBCDIC) or cursed its foreign limitations.

(2) Historically, people used "iolists" as strings.
    It has now been clarified that "iolists" are sequences of BYTES,
    not sequences of CHARACTERS.

    iolist --> [] | [byte | iolist] | [iolist | iolist].

    Nothing stops us using essentially the same data structure for
    characters, we just have to realise that what we get is not an
    iolist any more and built-ins that expect an iolist won't be
    happy with it.  That need not be a problem.

    fclist --> [] | [codepoint | fclist] | [fclist | fclist]

    These are slightly trickier to process from left to right, but
    only *slightly*.

    Selecting substrings is not particularly pleasant, and I cannot
    make it happen without turning over some garbage.

    Where these things shine is concatenation, which is O(1).
    The way to use them is to build up a complex string by concatenating
    all the pieces, then flattening it at the end.  This is, if you will,
    the Erlang analogue of Java's StringBuilder.  (Analogue, NOT homologue!)
    Dropping the first D characters and then taking the next T
    costs 

(3) Binaries.
    <<"abc">> is the same as <<97,98,99>>.  A binary is a packed immutable
    array of bytes.  (Well, it can be thought of as a bit string these days,
    but "bytes" are all we need for strings.)

    Binaries are very close to Java strings, except for usually taking less
    space.  They can be sliced in constant time.  They don't support fast
    concatenation, but then you can build up a list of binaries and
    concatenate them at the end.

In all three cases, the data structure itself does not keep track of what the
encoding is.  (And of course, neither do Java or C#.)

> 
> Can you please supply me with the sources to read or examples and
> hints about strings performance in Erlang.

It is all completely obvious once you know what the data structure is.
Simple one-way linked lists of character codes,
or (references to slices of) packed arrays of bytes.