[erlang-questions] Strings as Lists

Fri Feb 15 03:33:01 CET 2008

On 14 Feb 2008, at 7:07 pm, Zvi wrote:
> Richard,
>
> I think you confusing datatype with it's implementation/ 
> representation.

On the contrary.  Remember the topic of this thread?  Someone complained
that Erlang uses lists for strings (as its default implementation).   
I have
explained why this is a perfectly reasonable, indeed seriously good,  
way to
go for *some* uses of strings, but I have *also* explicitly made the  
point
that There Is More Than One Way To Do It, and that using the "default"
representation in ANY programming language is not always a good idea.

> My biggest problem with Erlang standard string representation is  
> that in
> 64bit mode, each character taking 16 bytes.

Now *you* are focussing on a particular implementation.
Erlang isn't standing behind your chair pointing a gun at your head and
saying "you MUST use lists for text or I will kill you"!

Given that Erlang processes in a single program may easily be scattered
across any number of machines, it seems to me that if you were seriously
worried about storage, you would be worried about ALL storage, not just
strings, and would be running many 32-bit OS-level processes on a  
machine
in order to halve your storage requirements.  As long as processes that
communicated a lot tended to be in the same process, this should work.
On today's multicore machines, this is (or should be) an attractive  
approach.

Use a tuple, and you will still be taking 8 bytes per Unicode character
when 3 will suffice.  By packing 3 unicode characters to an integer,
and using unrolled strict lists instead of plain lists (I have a first
draft of an ursl module with a subset of the lists: operations, if  
anyone
wants it) you can get 3*4 characters in 6*8 bytes, or 4 bytes each.
That's a factor of four.  And it's easy enough to write a preprocessor
to convert string literals (possibly flagged in some way) to that form.

More importantly, holding quantities of text in memory that are big  
enough
for the space required to be a serious problem is Not The Erlang  
Way.  The
idea is to stream *chunks* of text through the system.

> So typical message of 10KB after converted to XML become 20KB

Why are you converting to XML?
I know XML is bulky, but a doubling in size like that suggests a poorly
chosen XML schema.  Why are you not converting to *compressed* XML?

> and after parsed by 64bit Erlang VM become 320KB.

But why do that?  Lists of characters are, and are intended to be,  
useful
for *processing* (chunks of) text in Erlang.  If someone is handing you
globs of stuff that are you simply going to squirt down a wire, then a
binary is exactly what you want.

> Also, you forget that what was good for LISP Machines is not good for
> current machine architectures.

Please, do not make a habit of assuming your correspondent is a moron.
You appear to have missed the point, which is that what people THOUGHT
was good for Lisp machines WASN'T good for Lisp machines.  The only real
way to KNOW what is good or not is to implement and MEASURE.  You had
better believe that on a machine with a 16-bit bus, where we had a total
of 4MB address space available for Prolog's use, people complained  
bitterly
about the *obvious* inefficiency of using 8 bytes per character  
instead of
1.  With respect to time, they were simply wrong.  With respect to  
space,
the answer was "program so it doesn't matter".  We didn't go around  
converting
bitmaps to lists!

> Binary strings can be handled very efficiently using vectorized  
> SIMD code
> and they use modern caches much more effectively, than lists.

Um.  There are so many presuppositions in there that it's really not  
practical
to address them all in a single message.  First, the number of things  
that can
be done efficiently to *Unicode* using Vis/MMX/AltiVec instructions  
is rather
limited.  Second, what it's mostly limited to is things you should  
not be
doing.

But the big thing is that for many text processing tasks *both* lists  
*and*
binaries are spectacularly inefficient because of the limits they put on
sharing.  For example, my XML processing kit in C cannot and will not  
*change*
a document representation, so for most of the transformations I do, the
transformed XML data structure shares a large proportion of its  
memory with
the input.  You can't do that with a byte-vector string, and SIMD  
instructions
don't help with it.  DAGs are wonderful!

>
> Deep lists and io-lists are essentially analog of Java's  
> StringBuffer or C++
> STL's ostringstream.

Not "essentially".  They are *essentially* trees that offer O(1)  
concatenation
and shared substructure.  StringBuilder (I do hope your Java code  
doesn't use
StringBuffer any more) doesn't offer either of those.

> So in general I want immutable String ADT. The ADT implementation  
> should be
> smart enough, to switch to the best representation, according to  
> usage.

Who was the famous computer scientist who said
	"If anyone says to you, 'I want a programming language in which
	 I just say what I want', give him a lollipop."

You want a data structure which can't change (it's immutable) but does
change (it switches to another representation)?

We've been there.  Anyone else in this mailing list remember SETL?
Anyone remember the hopes for automatic selection of representation?
As I recall it, the optimiser got so big that they were never able to
run it over the compiler.  (And if anyone, DOES remember SETL and knows
how I can get in touch with David Bacon, I'd be grateful.  And if you  
have
a copy of On Programming that you don't want, I could give it a  
loving home.)
>

> BTW: most sophisticated string implementation I know about was in  
> SNOBOL. I
> think it was list of unique substrings.

The Bell Labs SNOBOL4 system and the SPITBOL system used different  
string
representations.  The main problem for SNOBOL was trying to support  
computer
hardware that didn't have byte addressing.