Strings (was: Re: are Mnesia tables immutable?)
Romain Lenglet
rlenglet@REDACTED
Wed Jun 28 08:47:57 CEST 2006
Richard A. O'Keefe wrote:
> Romain Lenglet <rlenglet@REDACTED> raises the
> issue of the external format.
>
> Suppose a string were a list of Unicode code points expressed
> as integers in the range 0..16r10FFFF. (Remember, Unicode
> simply cannot express any code outside that range.) How
> inefficient would that be?
>
> Romain Lenglet explored the question "how inefficient would
> that be WITH THE PRESENT EXTERNAL REPRESENTATION."
>
> Suppose we had a different external representation.
> Suppose we had
> <tuple N> e1' ... eN' => {e1, ..., eN}
> <list* N> e1' ... eN' eT' => [e1, ..., eN | eT]
> <list N> e1' ... eN' => [e1, ..., eN]
> <nats N> e1" ... eN" => [e1, ..., eN] where all ei
> naturals <atom N> c1" ... cN" => 'c1...cN'
> other stuff
> using variable byte encoding for things known to be
> non-negative integers, and a modified variable byte encoding
> for 3-bit-tag+length values.
>
> Length ASCII ..U+3FFF ALL Unicode
> 0..15 N+1 bytes 2N+1 bytes 3N+1 bytes
> 16..2063 N+2 bytes 2N+2 bytes 3N+2 bytes
> 2064..264207 N+3 bytes 2N+3 bytes 3N+3 bytes
>
> The simplest change to the Erlang external representation
> would be to make the STRING representation apply to any list
> of integers 0..2097151 and to make it use variable byte
> encoding for the values, which is never worse than UTF-8 and
> very often better.
The most efficient is still most often to use an official 8-bit
encoding for strings. E.g. for Thai, TIS-620 is the most
efficient, for Japanese, ISO-2022 (or others) is the most
efficient, etc.
But how to know which 8-bit encoding to use? We can't
automatically determine the 8-bit encoding to use, since (I
guess) this would be too costly. But the application usually
knows, or *should* know which encoding to use (e.g. using the
NLS configuration, etc.).
I think that we should find a way to identify lists of integers
as strings, i.e. as lists of Unicode code points, and
to "attach" a "preferred encoding" to such lists.
Of course, the problem is that to be usable and efficient, a lot
of 8-bit encodings have to be known and implemented in the
emulator (and erl_interface), which may increase the code size
significantly. But those encoding/decoding primitives could also
be made usable directly by programs, which would be very useful
in general.
> IMHO the best solution is to have something like Java:
> represent internally every character as one term (solution
> (1)), but we should have a way to "tag" a list to specify that
> it is a string, and therefore should be encoded appropriately.
>
> You are forgetting that Java is a statically typed language
> and Erlang is not. Adding more types to something like Erlang
> makes *everything* slower.
I was not thinking about adding a new type, but rather new
conventions. After all, the concept of "string" is only a matter
of conventions in Erlang!
For instance, I propose to represent a string as a 'string'
record:
{string, 'utf-8', [$a, $b]}
The second element would be the "preferred encoding" of the
string, and the third element the flat list of Unicode code
points.
Two forms would then be accepted for strings:
(1) as a flat list of Unicode code points (this is the current
form, kept for backward compatibility);
(2) as a tuple described just above.
> I note that NU Prolog had two internal representations for
> lists. Literal strings were stored as packed arrays of bytes,
> while other lists were stored as linked nets of pairs, and the
> NU Prolog emulator automagically converted strings to pairs as
> and when needed.
>
> The programmer who implemented that stuff (Jeff Schultz) once
> told me that he regretted it; that it had never really paid
> off.
Personally, I am voting for (1) representing strings as lists of
Unicode code points, but (2) providing a better (more flexible,
more efficient) external representation, and most importantly
(3) providing a more flexible interface to the external
encoding/decoding primitives, such as supporting strings as
tuples as above.
--
Romain LENGLET
More information about the erlang-questions
mailing list