Strings (was: Re: are Mnesia tables immutable?)

Wed Jun 28 08:47:57 CEST 2006

Richard A. O'Keefe wrote:
> Romain Lenglet <rlenglet@REDACTED> raises the
> issue of the external format.
>
> Suppose a string were a list of Unicode code points expressed
> as integers in the range 0..16r10FFFF.  (Remember, Unicode
> simply cannot express any code outside that range.)  How
> inefficient would that be?
>
> Romain Lenglet explored the question "how inefficient would
> that be WITH THE PRESENT EXTERNAL REPRESENTATION."
>
> Suppose we had a different external representation.
> Suppose we had
>     <tuple N>  e1' ... eN'		=> {e1, ..., eN}
>     <list* N>  e1' ... eN' eT'		=> [e1, ..., eN | eT]
>     <list N>   e1' ... eN'		=> [e1, ..., eN]
>     <nats N>   e1" ... eN"		=> [e1, ..., eN] where all ei
> naturals <atom N>   c1" ... cN"		=> 'c1...cN'
>     other stuff
> using variable byte encoding for things known to be
> non-negative integers, and a modified variable byte encoding
> for 3-bit-tag+length values.
>
>     Length		ASCII		..U+3FFF	ALL Unicode
>     0..15		N+1 bytes	2N+1 bytes	3N+1 bytes
>     16..2063		N+2 bytes	2N+2 bytes	3N+2 bytes
>     2064..264207	N+3 bytes	2N+3 bytes	3N+3 bytes
>
> The simplest change to the Erlang external representation
> would be to make the STRING representation apply to any list
> of integers 0..2097151 and to make it use variable byte
> encoding for the values, which is never worse than UTF-8 and
> very often better.

The most efficient is still most often to use an official 8-bit 
encoding for strings. E.g. for Thai, TIS-620 is the most 
efficient, for Japanese, ISO-2022 (or others) is the most 
efficient, etc.

But how to know which 8-bit encoding to use? We can't 
automatically determine the 8-bit encoding to use, since (I 
guess) this would be too costly. But the application usually 
knows, or *should* know which encoding to use (e.g. using the 
NLS configuration, etc.).

I think that we should find a way to identify lists of integers 
as strings, i.e. as lists of Unicode code points, and 
to "attach" a "preferred encoding" to such lists.

Of course, the problem is that to be usable and efficient, a lot 
of 8-bit encodings have to be known and implemented in the 
emulator (and erl_interface), which may increase the code size 
significantly. But those encoding/decoding primitives could also 
be made usable directly by programs, which would be very useful 
in general.

> 	IMHO the best solution is to have something like Java:
> represent internally every character as one term (solution
> (1)), but we should have a way to "tag" a list to specify that
> it is a string, and therefore should be encoded appropriately.
>
> You are forgetting that Java is a statically typed language
> and Erlang is not.  Adding more types to something like Erlang
> makes *everything* slower.

I was not thinking about adding a new type, but rather new 
conventions. After all, the concept of "string" is only a matter 
of conventions in Erlang!
For instance, I propose to represent a string as a 'string' 
record:

{string, 'utf-8', [$a, $b]}

The second element would be the "preferred encoding" of the 
string, and the third element the flat list of Unicode code 
points.

Two forms would then be accepted for strings:
(1) as a flat list of Unicode code points (this is the current 
form, kept for backward compatibility);
(2) as a tuple described just above.

> I note that NU Prolog had two internal representations for
> lists. Literal strings were stored as packed arrays of bytes,
> while other lists were stored as linked nets of pairs, and the
> NU Prolog emulator automagically converted strings to pairs as
> and when needed.
>
> The programmer who implemented that stuff (Jeff Schultz) once
> told me that he regretted it; that it had never really paid
> off.

Personally, I am voting for (1) representing strings as lists of 
Unicode code points, but (2) providing a better (more flexible, 
more efficient) external representation, and most importantly 
(3) providing a more flexible interface to the external 
encoding/decoding primitives, such as supporting strings as 
tuples as above.

-- 
Romain LENGLET