[erlang-questions] Strings as Lists
Richard A. O'Keefe
Fri Feb 15 03:33:01 CET 2008
On 14 Feb 2008, at 7:07 pm, Zvi wrote:
> I think you confusing datatype with it's implementation/
On the contrary. Remember the topic of this thread? Someone complained
that Erlang uses lists for strings (as its default implementation).
explained why this is a perfectly reasonable, indeed seriously good,
go for *some* uses of strings, but I have *also* explicitly made the
that There Is More Than One Way To Do It, and that using the "default"
representation in ANY programming language is not always a good idea.
> My biggest problem with Erlang standard string representation is
> that in
> 64bit mode, each character taking 16 bytes.
Now *you* are focussing on a particular implementation.
Erlang isn't standing behind your chair pointing a gun at your head and
saying "you MUST use lists for text or I will kill you"!
Given that Erlang processes in a single program may easily be scattered
across any number of machines, it seems to me that if you were seriously
worried about storage, you would be worried about ALL storage, not just
strings, and would be running many 32-bit OS-level processes on a
in order to halve your storage requirements. As long as processes that
communicated a lot tended to be in the same process, this should work.
On today's multicore machines, this is (or should be) an attractive
Use a tuple, and you will still be taking 8 bytes per Unicode character
when 3 will suffice. By packing 3 unicode characters to an integer,
and using unrolled strict lists instead of plain lists (I have a first
draft of an ursl module with a subset of the lists: operations, if
wants it) you can get 3*4 characters in 6*8 bytes, or 4 bytes each.
That's a factor of four. And it's easy enough to write a preprocessor
to convert string literals (possibly flagged in some way) to that form.
More importantly, holding quantities of text in memory that are big
for the space required to be a serious problem is Not The Erlang
idea is to stream *chunks* of text through the system.
> So typical message of 10KB after converted to XML become 20KB
Why are you converting to XML?
I know XML is bulky, but a doubling in size like that suggests a poorly
chosen XML schema. Why are you not converting to *compressed* XML?
> and after parsed by 64bit Erlang VM become 320KB.
But why do that? Lists of characters are, and are intended to be,
for *processing* (chunks of) text in Erlang. If someone is handing you
globs of stuff that are you simply going to squirt down a wire, then a
binary is exactly what you want.
> Also, you forget that what was good for LISP Machines is not good for
> current machine architectures.
Please, do not make a habit of assuming your correspondent is a moron.
You appear to have missed the point, which is that what people THOUGHT
was good for Lisp machines WASN'T good for Lisp machines. The only real
way to KNOW what is good or not is to implement and MEASURE. You had
better believe that on a machine with a 16-bit bus, where we had a total
of 4MB address space available for Prolog's use, people complained
about the *obvious* inefficiency of using 8 bytes per character
1. With respect to time, they were simply wrong. With respect to
the answer was "program so it doesn't matter". We didn't go around
bitmaps to lists!
> Binary strings can be handled very efficiently using vectorized
> SIMD code
> and they use modern caches much more effectively, than lists.
Um. There are so many presuppositions in there that it's really not
to address them all in a single message. First, the number of things
be done efficiently to *Unicode* using Vis/MMX/AltiVec instructions
limited. Second, what it's mostly limited to is things you should
But the big thing is that for many text processing tasks *both* lists
binaries are spectacularly inefficient because of the limits they put on
sharing. For example, my XML processing kit in C cannot and will not
a document representation, so for most of the transformations I do, the
transformed XML data structure shares a large proportion of its
the input. You can't do that with a byte-vector string, and SIMD
don't help with it. DAGs are wonderful!
> Deep lists and io-lists are essentially analog of Java's
> StringBuffer or C++
> STL's ostringstream.
Not "essentially". They are *essentially* trees that offer O(1)
and shared substructure. StringBuilder (I do hope your Java code
StringBuffer any more) doesn't offer either of those.
> So in general I want immutable String ADT. The ADT implementation
> should be
> smart enough, to switch to the best representation, according to
Who was the famous computer scientist who said
"If anyone says to you, 'I want a programming language in which
I just say what I want', give him a lollipop."
You want a data structure which can't change (it's immutable) but does
change (it switches to another representation)?
We've been there. Anyone else in this mailing list remember SETL?
Anyone remember the hopes for automatic selection of representation?
As I recall it, the optimiser got so big that they were never able to
run it over the compiler. (And if anyone, DOES remember SETL and knows
how I can get in touch with David Bacon, I'd be grateful. And if you
a copy of On Programming that you don't want, I could give it a
> BTW: most sophisticated string implementation I know about was in
> SNOBOL. I
> think it was list of unique substrings.
The Bell Labs SNOBOL4 system and the SPITBOL system used different
representations. The main problem for SNOBOL was trying to support
hardware that didn't have byte addressing.
More information about the erlang-questions