Strings (was: Re: are Mnesia tables immutable?)

Richard A. O'Keefe ok@REDACTED
Wed Jun 28 07:40:58 CEST 2006


Romain Lenglet <rlenglet@REDACTED> raises the
issue of the external format.

Suppose a string were a list of Unicode code points expressed as
integers in the range 0..16r10FFFF.  (Remember, Unicode simply cannot
express any code outside that range.)  How inefficient would that be?

Romain Lenglet explored the question "how inefficient would that be
WITH THE PRESENT EXTERNAL REPRESENTATION."

Suppose we had a different external representation.
Suppose we had
    <tuple N>  e1' ... eN'		=> {e1, ..., eN}
    <list* N>  e1' ... eN' eT'		=> [e1, ..., eN | eT]
    <list N>   e1' ... eN'		=> [e1, ..., eN]
    <nats N>   e1" ... eN"		=> [e1, ..., eN] where all ei naturals
    <atom N>   c1" ... cN"		=> 'c1...cN'
    other stuff
using variable byte encoding for things known to be non-negative integers,
and a modified variable byte encoding for 3-bit-tag+length values.

    Length		ASCII		..U+3FFF	ALL Unicode
    0..15		N+1 bytes	2N+1 bytes	3N+1 bytes
    16..2063		N+2 bytes	2N+2 bytes	3N+2 bytes
    2064..264207	N+3 bytes	2N+3 bytes	3N+3 bytes

The simplest change to the Erlang external representation would be to
make the STRING representation apply to any list of integers 0..2097151
and to make it use variable byte encoding for the values, which is never
worse than UTF-8 and very often better.

	IMHO the best solution is to have something like Java: represent 
	internally every character as one term (solution (1)), but we 
	should have a way to "tag" a list to specify that it is a 
	string, and therefore should be encoded appropriately.

You are forgetting that Java is a statically typed language and Erlang
is not.  Adding more types to something like Erlang makes *everything*
slower.

I note that NU Prolog had two internal representations for lists.
Literal strings were stored as packed arrays of bytes, while other lists
were stored as linked nets of pairs, and the NU Prolog emulator automagically
converted strings to pairs as and when needed.

The programmer who implemented that stuff (Jeff Schultz) once told me
that he regretted it; that it had never really paid off.




More information about the erlang-questions mailing list