[erlang-questions] Strings as Lists
Richard A. O'Keefe
ok@REDACTED
Thu Feb 14 04:49:44 CET 2008
On 13 Feb 2008, at 12:41 pm, Kevin Scaldeferri wrote:
> Hold on... lists aren't really a particular convenient or efficient
> data structure for working with strings.
For some people they are.
> First off, I append to
> strings a lot more than I prepend to them.
I take this to mean that you do stuff like
S0 = ~0~,
S1 = S0 ++ ~1~,
...
Sn = Sn_1 ++ ~n~
perhaps in a loop. Right, this is not efficient. But it is
spectacularly
inefficient in programming languages with more conventional
representations.
It is O(n**2). For example,
x = ""
for (i = 1; i <= 100000; i++) x = x "a"
just took 30.5 seconds in awk on my machine, 62.2 seconds in Perl,
and a massive
631 seconds in Java. That was using gcj; I lost patience with the
Sun SDK and
killed it. (AWK faster than Java? Yes, it often is.) Building the
same string
in Erlang using
loop(100000, "")
where
loop(0, S) -> lists:reverse(S);
loop(N, S) -> loop(N-1, "a"++S).
takes 0.15 second on the same machine.
> Yeah, I could work with
> reversed strings, but that's a hack to deal with using the wrong data
> type.
No, it has to do with the fact that appending on the right is O(n)
whether one uses lists or arrays. Arguably, you should be using the
right data type THE RIGHT WAY. Can you provide an example of your code?
Alternatively, maybe iolists would suit you better:
loop(0, S) -> lists:flatten(S);
loop(N, S) -> loop(N-1, [S|"a"]).
also takes 0.15 second, and is appending on the right. It's quite a
general
principle that the data structure or traversal that is convenient for
building
something up isn't necessarily the same as the data structure or
traversal that
is convenient for processing it afterwards;
> Plus, I probably prefix match more often than suffix matching
> (although this is less lopsided than append vs. prepend). Of course,
> I also like to do substring matching and regular expressions quite a
> bit, and Boyer-Moore is definitely more efficient with arrays than
> lists.
The Boyer-Moore algorithm requires space proportional to the alphabet
size.
The Unicode alphabet size is enormous (last time I looked there were
nearly
100 000 characters defined). Unicode substring matching is
definitely nasty,
given that, for example, you would like (e,floating acute) to match
é, and
the Boyer-Moore algorithm assumes a unique encoding of any given string,
which Unicode does not even begin to have. Then there are fun things
like
U+0028 is the code for the "(" character, but if I see a "(" and go
looking
for it, I might have to look for a U+0029 instead, and there really
doesn't
seem to be any way of dealing with that without processing a bunch of
control characters. (I know I am talking about appearance here, not
semantics,
but sometimes one wants to search based on appearance.)
>
> It's probably worth noting that none of the languages which are
> considered _good_ for working with strings (AFAIK) use a list
> representation.
Not really. Perl is normally "considered _good_ for working with
strings",
but it is pretty much hopeless at building strings by concatenation.
(Yes, I know about StringBuffer and StringBuilder, but they are
DIFFERENT
from String. The only language I know that's really good at it is
Smalltalk, where you can do
result := String writeStream: [:s |
"any code you want, writing to the character stream s"].
and the internal representation of the intermediate result is none of
your
business.
More information about the erlang-questions
mailing list