[erlang-questions] Strings as Lists
Richard A. O'Keefe
ok@REDACTED
Thu Feb 14 02:14:56 CET 2008
On 13 Feb 2008, at 5:19 am, tsuraan wrote:
> Why does erlang internally represent strings as lists? In every
> language I've used other than Java, a string is a sequence of
> octets, just like Erlang's binary type.
Recall the programming proverb:
it is better to have one data type with 100 functions
than 10 data types with 10 functions each.
It is actually quite commonplace in declarative languages (Prolog,
Mercury,
Haskell, some others) to implement strings as lists of characters
because
that way you get a vast number of functions you can usefully apply to
strings.
Not only that, you can process strings *incrementally* if they are
lists.
I never tire of telling this story:
I was on the team that ported Quintus Prolog from the "UNIX" world
to the Xerox Lisp machines (the 1108, 1109, 1185, and 1186, aka
Dandelion, Dandetiger, Daybreak, and something else I forget).
These machines compiled Interlisp to bytecode which was then
executed by microcode, and gave very respectable performance for
Lisp, in their day. The existing microcode supported Interlisp
strings, which were byte vectors as you describe.
There wasn't much microcode space left to support Quintus Prolog,
so we had a microcoded WAM with plain old 2 word list cells and
strings represented by 1 list cell per character.
A series of benchmarks I did showed string processing going
FIVE TIMES FASTER in Prolog using lists than in Lisp using byte
vectors, largely because we could represent "the rest of a string"
using NO new allocations whatever.
The Java string representation is (slice of (array of uint16_t))
which means
that "the rest of a string" costs only O(1) time and space, but O(1)
extra
space isn't O(0) extra space. I would expect most Javascript
implementations
to use something similar to this.
The guiding rule is
- if you just want to hold onto a string for a while, use a binary
- if you want to build or process a string, use a list (possibly in
Erlang a deep list).
- if you want to represent something that has structure, and you want
your program to be aware of that structure, turn it into a
structured
data value and work with it in that form.
Basically, most languages get strings embarassingly wrong. Another
story
I like to tell is how I was able to do in 10 pages of C what a friend of
mine needed 150 pages of PL/I to accomplish, largely because PL/I *does*
have a "real" string data type and C doesn't, so I could accomplish what
I needed very simply, while he had to fight the language every step
of the
way.
More information about the erlang-questions
mailing list