[erlang-questions] Strings as Lists

Thu Feb 14 20:25:21 CET 2008

On Feb 13, 2008, at 8:02 PM, Richard A.O'Keefe wrote:

> On 13 Feb 2008, at 12:41 pm, Kevin Scaldeferri wrote:
>> Hold on... lists aren't really a particular convenient or efficient
>> data structure for working with strings.
>
> For some people they are.
>
>> First off, I append to
>> strings a lot more than I prepend to them.
>
> I take this to mean that you do stuff like
>
> 	S0 = ~0~,
> 	S1 = S0 ++ ~1~,
> 	...
> 	Sn = Sn_1 ++ ~n~

No, certainly not one character at a time.  But I frequently build up  
an HTML document or some report piece by piece.

>
>
> perhaps in a loop.  Right, this is not efficient.  But it is
> spectacularly
> inefficient in programming languages with more conventional
> representations.
> It is O(n**2).

No, in a decent string implementation it is O(n).

> For example,
> 	x = ""
> 	for (i = 1; i <= 100000; i++) x = x "a"
> just took 30.5 seconds in awk on my machine, 62.2 seconds in Perl,
> and a massive
> 631 seconds in Java.

You either have a very old perl, a very slow machine, or have  
implemented it quite badly:

% time perl -le 'my $x = ""; $x .= "a" for 1..100_000; print length $x'
100000
perl -le '...'  0.02s user 0.00s system 89% cpu 0.024 total

even faster (but even less realistic, and not really the same  
benchmark):

% time perl -le 'my $x = "a" x 100_000; print length $x'
100000
perl -le '...'  0.00s user 0.00s system 85% cpu 0.006 total

>
>>  Yeah, I could work with
>> reversed strings, but that's a hack to deal with using the wrong data
>> type.
>
> No, it has to do with the fact that appending on the right is O(n)
> whether one uses lists or arrays.  Arguably, you should be using the
> right data type THE RIGHT WAY.  Can you provide an example of your  
> code?

Someone else explained how C++ strings/vectors handle this.  Perl uses  
essentially the same strategy.

>
>
>> Plus, I probably prefix match more often than suffix matching
>> (although this is less lopsided than append vs. prepend).  Of course,
>> I also like to do substring matching and regular expressions quite a
>> bit, and Boyer-Moore is definitely more efficient with arrays than
>> lists.
>
> The Boyer-Moore algorithm requires space proportional to the alphabet
> size.
> The Unicode alphabet size is enormous (last time I looked there were
> nearly
> 100 000 characters defined).

If you use UTF-8 encoded strings, you can use Boyer-Moore unmodified  
(matching at the level of bytes, rather than characters).  For a fixed  
width encoding, you can also work with bytes, but you'll have to take  
the character width into account when considering how to slide.

>  Unicode substring matching is
> definitely nasty,
> given that, for example, you would like (e,floating acute) to match
> é, and
> the Boyer-Moore algorithm assumes a unique encoding of any given  
> string,
> which Unicode does not even begin to have.  Then there are fun things
> like
> U+0028 is the code for the "(" character, but if I see a "(" and go
> looking
> for it, I might have to look for a U+0029 instead,

Indeed, Unicode is really a PITA like this, but I don't think a list  
representation makes this magically go away, either.  In fact, in your  
example with combining characters, you need to be able to backtrack,  
which is more awkward and less performant on lists than arrays.  But,  
if you disagree, I think this would be a very interesting sort of  
benchmark to do.  Currently, Erlang appears to perform quite awfully  
on anything to do with strings:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=hipe&lang2=perl

Now, most of these conflate string handling and file I/O, so it could  
be that file I/O is more of the problem.  Or maybe they are badly  
implemented.  Since you (and many on this list) are concerned highly  
with Unicode string handling, I think it would be quite interested to  
design a benchmark of that.   Perhaps something that did substring (or  
regexp) matching on a large number of candidate strings.  Perhaps  
Erlang would do better.  Certainly I'd expect it to be much less code  
than C/C++ (although that's practically a given).  I'd be very  
interested to compare to Perl's Unicode strings, or Java.

>>
>> It's probably worth noting that none of the languages which are
>> considered _good_ for working with strings (AFAIK) use a list
>> representation.
>
> Not really.  Perl is normally "considered _good_ for working with
> strings",
> but it is pretty much hopeless at building strings by concatenation.

I hope I've demonstrated above that this claim is false.

-kevin