Strings (was: Re: are Mnesia tables immutable?)
Romain Lenglet
rlenglet@REDACTED
Wed Jun 28 11:21:33 CEST 2006
Romain Lenglet wrote:
> Richard A. O'Keefe wrote:
[deleted the description of your proposed encoding]
> > The simplest change to the Erlang external representation
> > would be to make the STRING representation apply to any list
> > of integers 0..2097151 and to make it use variable byte
> > encoding for the values, which is never worse than UTF-8 and
> > very often better.
>
> The most efficient is still most often to use an official
> 8-bit encoding for strings. E.g. for Thai, TIS-620 is the most
> efficient, for Japanese, ISO-2022 (or others) is the most
> efficient, etc.
>
> But how to know which 8-bit encoding to use? We can't
> automatically determine the 8-bit encoding to use, since (I
> guess) this would be too costly. But the application usually
> knows, or *should* know which encoding to use (e.g. using the
> NLS configuration, etc.).
[...]
> I was not thinking about adding a new type, but rather new
> conventions. After all, the concept of "string" is only a
> matter of conventions in Erlang!
> For instance, I propose to represent a string as a 'string'
> record:
>
> {string, 'utf-8', [$a, $b]}
>
> The second element would be the "preferred encoding" of the
> string, and the third element the flat list of Unicode code
> points.
I have been thinking about another solution, completely
compatible with existing code:
Let's define a few new BIFs:
set_default_string_external_encoding/1,
get_default_string_external_encoding/0 and
get_available_string_external_encodings/0.
These may manipulate the default encoding of strings for the
whole Erlang node, which can be chosen among the list returned
by get_available_string_external_encodings/0, e.g.
['utf-8', 'iso-8859-1', 'tis-620', ...].
That default encoding could also be specified on erl's command
line, or determined from the environment (e.g. environment
variable LC_CTYPE on Unixes).
As a variant, we could have a per-process default string
encoding. (???)
When encoding a list, e.g. when calling term_to_binary/1, the
emulator would visit the elements, as it does now, to test if
all elements are integers >= 0, if they are <= 255, and if the
list is flat.
BUT in addition, it would check that the integers can be encoded
using that default encoding (if it is 'utf-8', it can always be
encoded, but with other 8-bit encodings it is not the case).
We could then distinguish 3 cases:
(1) all elements are integers >= 0 and <= 255:
we encode using the currently available STRING_EXT format.
else
(2) all elements are integers >= 0 and that can be encoded using
the default encoding:
we encode the string in a new STRING_ENC_EXT format:
byte 0: STRING_ENC_EXT tag
byte 1: identifier for the encoding
bytes 2-3: number of bytes (16 bits)
bytes 4-: string encoded using the default encoding
else
(3) all elements are integers:
the list is encoded using your variable-length encoding for every
integer (or why not ASN.1's BER or PER encoding?), in a new
INTEGER_LIST_EXT format:
byte 0: INTEGER_LIST_EXT tag
bytes 1-2: number of bytes (16 bits)
bytes 3-: encoded integers
else (the list is not flat or contains non-integers)
(4) the list is encoded using the currently available LIST_EXT
format.
For existing code that uses only ASCII strings, the external
encoding of strings is unchanged: (1) is always used.
For programs that combine strings that contain ASCII English
text, and strings in *one* other language, strings are either
encoded using (1) or (2), which are optimal external encodings
(if the default encoding is chosen carefully).
For other cases, i.e. programs that manipulate strings in several
non-English languages, or strings that mix several languages,
then strings can be encoded using either (1), (2) or (3), which
is a "not so bad" compromise.
--
Romain LENGLET
More information about the erlang-questions
mailing list