Strings (was: Re: are Mnesia tables immutable?)

Wed Jun 28 11:21:33 CEST 2006

Romain Lenglet wrote:
> Richard A. O'Keefe wrote:
[deleted the description of your proposed encoding]
> > The simplest change to the Erlang external representation
> > would be to make the STRING representation apply to any list
> > of integers 0..2097151 and to make it use variable byte
> > encoding for the values, which is never worse than UTF-8 and
> > very often better.
>
> The most efficient is still most often to use an official
> 8-bit encoding for strings. E.g. for Thai, TIS-620 is the most
> efficient, for Japanese, ISO-2022 (or others) is the most
> efficient, etc.
>
> But how to know which 8-bit encoding to use? We can't
> automatically determine the 8-bit encoding to use, since (I
> guess) this would be too costly. But the application usually
> knows, or *should* know which encoding to use (e.g. using the
> NLS configuration, etc.).
[...]
> I was not thinking about adding a new type, but rather new
> conventions. After all, the concept of "string" is only a
> matter of conventions in Erlang!
> For instance, I propose to represent a string as a 'string'
> record:
>
> {string, 'utf-8', [$a, $b]}
>
> The second element would be the "preferred encoding" of the
> string, and the third element the flat list of Unicode code
> points.

I have been thinking about another solution, completely 
compatible with existing code:

Let's define a few new BIFs: 
set_default_string_external_encoding/1, 
get_default_string_external_encoding/0 and 
get_available_string_external_encodings/0.
These may manipulate the default encoding of strings for the 
whole Erlang node, which can be chosen among the list returned 
by get_available_string_external_encodings/0, e.g. 
['utf-8', 'iso-8859-1', 'tis-620', ...].

That default encoding could also be specified on erl's command 
line, or determined from the environment (e.g. environment 
variable LC_CTYPE on Unixes).

As a variant, we could have a per-process default string 
encoding. (???)

When encoding a list, e.g. when calling term_to_binary/1, the 
emulator would visit the elements, as it does now, to test if 
all elements are integers >= 0, if they are <= 255, and if the 
list is flat.
BUT in addition, it would check that the integers can be encoded 
using that default encoding (if it is 'utf-8', it can always be 
encoded, but with other 8-bit encodings it is not the case).

We could then distinguish 3 cases:

(1) all elements are integers >= 0 and <= 255:
we encode using the currently available STRING_EXT format.

else
(2) all elements are integers >= 0 and that can be encoded using 
the default encoding:
we encode the string in a new STRING_ENC_EXT format:
byte 0: STRING_ENC_EXT tag
byte 1: identifier for the encoding
bytes 2-3: number of bytes (16 bits)
bytes 4-: string encoded using the default encoding

else
(3) all elements are integers:
the list is encoded using your variable-length encoding for every 
integer (or why not ASN.1's BER or PER encoding?), in a new 
INTEGER_LIST_EXT format:
byte 0: INTEGER_LIST_EXT tag
bytes 1-2: number of bytes (16 bits)
bytes 3-: encoded integers

else (the list is not flat or contains non-integers)
(4) the list is encoded using the currently available LIST_EXT 
format.

For existing code that uses only ASCII strings, the external 
encoding of strings is unchanged: (1) is always used.

For programs that combine strings that contain ASCII English 
text, and strings in *one* other language, strings are either 
encoded using (1) or (2), which are optimal external encodings 
(if the default encoding is chosen carefully).

For other cases, i.e. programs that manipulate strings in several 
non-English languages, or strings that mix several languages, 
then strings can be encoded using either (1), (2) or (3), which 
is a "not so bad" compromise.

-- 
Romain LENGLET