[erlang-questions] Strings usage caveats
Richard O'Keefe
ok@REDACTED
Mon Mar 5 22:11:44 CET 2012
On 6/03/2012, at 2:49 AM, Aleksandr Vinokurov wrote:
> I have a very poor experience in programming in Erlang/OTP so that
> sentence was rather abstract for me. I suppose that the root of the
> problems with strings is in variables immutability and thus a copying
> of the whole source string in case of its modification. But it seems
> to me that it's not that all.
No, do not suppose that. After all, strings are immutable in Java as
well, and nobody talks about _them_ being inefficient. (Except me,
but that's another story, and it's not related to mutability.)
In a Unicode world, there are very very very few cases where it makes
sense to change part of a string other than by
taking a substring
concatenating with other strings
applying a regular expression rewrite (like the 's/old/new/' command
in Vi(m)).
There are three representations of strings commonly used in Erlang,
and they have different tradeoffs.
(1) Linked lists of character codes.
"abc" is the same thing as [97,98,99].
These are simple to process from left to right.
Dropping the first D characters takes O(D) time (Java: O(1)).
Taking the first T characters takes O(T) time (Java: O(1)).
Concatenating length M with length N takes O(M) time (Java: O(M+N)).
The problem is SPACE. Each character requires one machine word for
the character code and another to point to the rest of the list.
So a string of N characters requires 8N bytes in a 32-bit
environment, or 16N bytes in a 64-bit environment. C using UTF-8
would probably take about 1.1N bytes for the kind of stuff I see
these days; Java would take 2N. Of course, to handle full Unicode
without the kind of contortions Java requires, you'd need 4N bytes
anyway, so 8N doesn't look as bad as it did when everyone either
rejoiced in ASCII (or EBCDIC) or cursed its foreign limitations.
(2) Historically, people used "iolists" as strings.
It has now been clarified that "iolists" are sequences of BYTES,
not sequences of CHARACTERS.
iolist --> [] | [byte | iolist] | [iolist | iolist].
Nothing stops us using essentially the same data structure for
characters, we just have to realise that what we get is not an
iolist any more and built-ins that expect an iolist won't be
happy with it. That need not be a problem.
fclist --> [] | [codepoint | fclist] | [fclist | fclist]
These are slightly trickier to process from left to right, but
only *slightly*.
Selecting substrings is not particularly pleasant, and I cannot
make it happen without turning over some garbage.
Where these things shine is concatenation, which is O(1).
The way to use them is to build up a complex string by concatenating
all the pieces, then flattening it at the end. This is, if you will,
the Erlang analogue of Java's StringBuilder. (Analogue, NOT homologue!)
Dropping the first D characters and then taking the next T
costs
(3) Binaries.
<<"abc">> is the same as <<97,98,99>>. A binary is a packed immutable
array of bytes. (Well, it can be thought of as a bit string these days,
but "bytes" are all we need for strings.)
Binaries are very close to Java strings, except for usually taking less
space. They can be sliced in constant time. They don't support fast
concatenation, but then you can build up a list of binaries and
concatenate them at the end.
In all three cases, the data structure itself does not keep track of what the
encoding is. (And of course, neither do Java or C#.)
>
> Can you please supply me with the sources to read or examples and
> hints about strings performance in Erlang.
It is all completely obvious once you know what the data structure is.
Simple one-way linked lists of character codes,
or (references to slices of) packed arrays of bytes.
More information about the erlang-questions
mailing list