[erlang-questions] Strings as Lists
Richard A. O'Keefe
Thu Feb 14 04:49:44 CET 2008
On 13 Feb 2008, at 12:41 pm, Kevin Scaldeferri wrote:
> Hold on... lists aren't really a particular convenient or efficient
> data structure for working with strings.
For some people they are.
> First off, I append to
> strings a lot more than I prepend to them.
I take this to mean that you do stuff like
S0 = ~0~,
S1 = S0 ++ ~1~,
Sn = Sn_1 ++ ~n~
perhaps in a loop. Right, this is not efficient. But it is
inefficient in programming languages with more conventional
It is O(n**2). For example,
x = ""
for (i = 1; i <= 100000; i++) x = x "a"
just took 30.5 seconds in awk on my machine, 62.2 seconds in Perl,
and a massive
631 seconds in Java. That was using gcj; I lost patience with the
Sun SDK and
killed it. (AWK faster than Java? Yes, it often is.) Building the
in Erlang using
loop(0, S) -> lists:reverse(S);
loop(N, S) -> loop(N-1, "a"++S).
takes 0.15 second on the same machine.
> Yeah, I could work with
> reversed strings, but that's a hack to deal with using the wrong data
No, it has to do with the fact that appending on the right is O(n)
whether one uses lists or arrays. Arguably, you should be using the
right data type THE RIGHT WAY. Can you provide an example of your code?
Alternatively, maybe iolists would suit you better:
loop(0, S) -> lists:flatten(S);
loop(N, S) -> loop(N-1, [S|"a"]).
also takes 0.15 second, and is appending on the right. It's quite a
principle that the data structure or traversal that is convenient for
something up isn't necessarily the same as the data structure or
is convenient for processing it afterwards;
> Plus, I probably prefix match more often than suffix matching
> (although this is less lopsided than append vs. prepend). Of course,
> I also like to do substring matching and regular expressions quite a
> bit, and Boyer-Moore is definitely more efficient with arrays than
The Boyer-Moore algorithm requires space proportional to the alphabet
The Unicode alphabet size is enormous (last time I looked there were
100 000 characters defined). Unicode substring matching is
given that, for example, you would like (e,floating acute) to match
the Boyer-Moore algorithm assumes a unique encoding of any given string,
which Unicode does not even begin to have. Then there are fun things
U+0028 is the code for the "(" character, but if I see a "(" and go
for it, I might have to look for a U+0029 instead, and there really
seem to be any way of dealing with that without processing a bunch of
control characters. (I know I am talking about appearance here, not
but sometimes one wants to search based on appearance.)
> It's probably worth noting that none of the languages which are
> considered _good_ for working with strings (AFAIK) use a list
Not really. Perl is normally "considered _good_ for working with
but it is pretty much hopeless at building strings by concatenation.
(Yes, I know about StringBuffer and StringBuilder, but they are
from String. The only language I know that's really good at it is
Smalltalk, where you can do
result := String writeStream: [:s |
"any code you want, writing to the character stream s"].
and the internal representation of the intermediate result is none of
More information about the erlang-questions