[erlang-questions] Strings as Lists

Thu Feb 14 02:14:56 CET 2008

On 13 Feb 2008, at 5:19 am, tsuraan wrote:

> Why does erlang internally represent strings as lists?  In every  
> language I've used other than Java, a string is a sequence of  
> octets, just like Erlang's binary type.

Recall the programming proverb:
	it is better to have one data type with 100 functions
	than 10 data types with 10 functions each.

It is actually quite commonplace in declarative languages (Prolog,  
Mercury,
Haskell, some others) to implement strings as lists of characters  
because
that way you get a vast number of functions you can usefully apply to  
strings.
Not only that, you can process strings *incrementally* if they are  
lists.

I never tire of telling this story:
	I was on the team that ported Quintus Prolog from the "UNIX" world
	to the Xerox Lisp machines (the 1108, 1109, 1185, and 1186, aka
	Dandelion, Dandetiger, Daybreak, and something else I forget).
	These machines compiled Interlisp to bytecode which was then
	executed by microcode, and gave very respectable performance for
	Lisp, in their day.  The existing microcode supported Interlisp
	strings, which were byte vectors as you describe.

	There wasn't much microcode space left to support Quintus Prolog,
	so we had a microcoded WAM with plain old 2 word list cells and
	strings represented by 1 list cell per character.

	A series of benchmarks I did showed string processing going
	FIVE TIMES FASTER in Prolog using lists than in Lisp using byte
	vectors, largely because we could represent "the rest of a string"
	using NO new allocations whatever.

The Java string representation is (slice of (array of uint16_t))  
which means
that "the rest of a string" costs only O(1) time and space, but O(1)  
extra
space isn't O(0) extra space.  I would expect most Javascript  
implementations
to use something similar to this.

The guiding rule is
  - if you just want to hold onto a string for a while, use a binary
  - if you want to build or process a string, use a list (possibly in
    Erlang a deep list).
  - if you want to represent something that has structure, and you want
    your program to be aware of that structure, turn it into a  
structured
    data value and work with it in that form.

Basically, most languages get strings embarassingly wrong.  Another  
story
I like to tell is how I was able to do in 10 pages of C what a friend of
mine needed 150 pages of PL/I to accomplish, largely because PL/I *does*
have a "real" string data type and C doesn't, so I could accomplish what
I needed very simply, while he had to fight the language every step  
of the
way.