[erlang-questions] Strings as Lists

Thu Feb 14 05:02:30 CET 2008

On 13 Feb 2008, at 12:41 pm, Kevin Scaldeferri wrote:
> Hold on... lists aren't really a particular convenient or efficient
> data structure for working with strings.

For some people they are.

 > First off, I append to
> strings a lot more than I prepend to them.

I take this to mean that you do stuff like

	S0 = ~0~,
	S1 = S0 ++ ~1~,
	...
	Sn = Sn_1 ++ ~n~

perhaps in a loop.  Right, this is not efficient.  But it is  
spectacularly
inefficient in programming languages with more conventional  
representations.
It is O(n**2).  For example,
	x = ""
	for (i = 1; i <= 100000; i++) x = x "a"
just took 30.5 seconds in awk on my machine, 62.2 seconds in Perl,  
and a massive
631 seconds in Java.  That was using gcj; I lost patience with the  
Sun SDK and
killed it.  (AWK faster than Java?  Yes, it often is.) Building the  
same string
in Erlang using
	loop(100000, "")
where
	loop(0, S) -> lists:reverse(S);
	loop(N, S) -> loop(N-1, "a"++S).
takes 0.15 second on the same machine.

>   Yeah, I could work with
> reversed strings, but that's a hack to deal with using the wrong data
> type.

No, it has to do with the fact that appending on the right is O(n)
whether one uses lists or arrays.  Arguably, you should be using the
right data type THE RIGHT WAY.  Can you provide an example of your code?

Alternatively, maybe iolists would suit you better:

	loop(0, S) -> lists:flatten(S);
	loop(N, S) -> loop(N-1, [S|"a"]).

also takes 0.15 second, and is appending on the right.  It's quite a  
general
principle that the data structure or traversal that is convenient for  
building
something up isn't necessarily the same as the data structure or  
traversal that
is convenient for processing it afterwards;

> Plus, I probably prefix match more often than suffix matching
> (although this is less lopsided than append vs. prepend).  Of course,
> I also like to do substring matching and regular expressions quite a
> bit, and Boyer-Moore is definitely more efficient with arrays than
> lists.

The Boyer-Moore algorithm requires space proportional to the alphabet  
size.
The Unicode alphabet size is enormous (last time I looked there were  
nearly
100 000 characters defined).  Unicode substring matching is  
definitely nasty,
given that, for example, you would like (e,floating acute) to match  
é, and
the Boyer-Moore algorithm assumes a unique encoding of any given string,
which Unicode does not even begin to have.  Then there are fun things  
like
U+0028 is the code for the "(" character, but if I see a "(" and go  
looking
for it, I might have to look for a U+0029 instead, and there really  
doesn't
seem to be any way of dealing with that without processing a bunch of
control characters.  (I know I am talking about appearance here, not  
semantics,
but sometimes one wants to search based on appearance.)
>
> It's probably worth noting that none of the languages which are
> considered _good_ for working with strings (AFAIK) use a list
> representation.

Not really.  Perl is normally "considered _good_ for working with  
strings",
but it is pretty much hopeless at building strings by concatenation.
The only language I know that's really good at it is Smalltalk, where  
you can do

	result := String writeStream: [:s |
	   "any code you want, writing to the character stream s"].

and the internal representation of the intermediate result is none of  
your
business.  There are also libraries from the functional programming  
world
that give you efficient incremental update of large strings, but they  
are
fairly heavyweight.