Strings (was: Re: are Mnesia tables immutable?)
Romain Lenglet
rlenglet@REDACTED
Wed Jun 28 06:00:45 CEST 2006
Richard A. O'Keefe wrote:
> Thomas Lindgren <thomasl_erlang@REDACTED> wrote:
> Assuming one would want to implement a character
> datatype for Erlang, do you think Unicode is the way
> to go? Or something else (e.g., settle for UTF-8)? Or
> some sort of framework a la Common Lisp? Or wait and
> see a bit longer?
>
> UTF-8 is a fine thing, but it is not a *character* data type.
> UTF-8 is a way of encoding *strings*.
[deleted]
Thanks for your explanations.
I still see an interest in an Erlang data type for strings, or in
a way to "tag" an integer list as a string, when considering the
Erlang external encoding format (cf. term_to_binary/1).
In Erlang, strings are represented as flat lists of integers <=
255. When transfering strings or any kinds of lists, nobody
cares about encoding, and uses term_to_binary/1 (well, the
emulator's external encoding functions). term_to_binary/1
is "smart" when encoding lists: it visits the list, and if it
detects that the list is flat and all elements are integers are
<= 255, then it encodes it as a string (a special
representation!), otherwise as a list of terms.
The string external format is compact:
byte 0: STRING_EXT tag
bytes 1-2: number of bytes (16 bits)
bytes 3-: the bytes that form the string
I.e., a string of N ASCII characters is encoded into N+3 bytes.
However, the list format is much less efficient when transfering
a flat list of integers:
byte 0: LIST_EXT tag
bytes 1-4: number of terms (32 bits), not counting the tail
bytes 5-: the terms
last bytes: the tail, which is [] when the list is flat (encoded
as one byte: the NIL_EXT tag).
Integer terms are encoded into different forms depending on their
value. Since Unicode considers encoding every character into up
to 32 bits, not more, let's consider the different encoding
formats available for 32-bit integers:
(1) 0<=x<=255:
byte 0: SMALL_INTEGER_EXT tag
byte 1: x (8 bits)
(2) -134217728<=x<=134217727
byte 0: INTEGER_EXT tag
bytes 1-4: x (32 bits)
(3) all other cases:
byte 0: SMALL_BIG_EXT
byte 1: length (== 4)
byte 2: sign (8 bits!)
bytes 3-6: abs(x) (32 bits)
There is also the LARGE_BIG_EXT encoding format, for larger
integers, but we don't need to consider this for 32-bit values.
Let's go back to flat lists of integers. If a flat list of
integers contains values < 0 or > 255, then the total encoding
size is:
6 + (2 to 7 bytes for every integer)
Then, in the worst case, a list of N integers can be encoded into
6+N*7 bytes!!!
Then, we can consider how to represent strings in Erlang:
either (1) we represent a string as a list of integers, where
every integer is the Unicode character as a 32-bit value,
or (2) we represent a string as a list of integers <=255, which
is the UTF-8 encoding (or just any encoding) of the string.
The problem with (1), is that we end up with a very inefficient
external encoding of strings (as a list, see above). The real
problem, is that we have no way to specify that a list is a
string, and that it should be encoded, e.g. into UTF-8, when
externally encoded e.g. by term_to_binary/1.
In case (1), to have a more efficient representation of strings,
encoding/decoding must be done by programmers explicitly, e.g.
by doing:
Pid ! to_utf8(String)
instead of directly:
Pid ! String
But who does that? I have seen nobody caring about encoding of
strings in Erlang code.
At least, in Java Strings are distinct data type, and the APIs
force everybody to consider the encoding/decoding of strings. It
is not transparent, and it is a Good Thing (when considering
external representation).
Case (2) introduces a lot of problems: the length of a string may
no more be the number of characters, and a character may be
encoded as several terms in the list. This would surely break a
lot of existing code, and makes strings handling awkward.
IMHO the best solution is to have something like Java: represent
internally every character as one term (solution (1)), but we
should have a way to "tag" a list to specify that it is a
string, and therefore should be encoded appropriately. The
heuristic in the emulator, which considers a list as a string
iff it is flat and contains only integers <=255, is not
appropriate to handle Unicode.
In addition, the external representation of strings should
transmit for every string, for instance in 1 additional header
byte, the encoding format of the string. E.g., well-known
constants for that byte could be:
0 = "ASCII"
1 = "UTF-8"
etc.
--
Romain Lenglet
More information about the erlang-questions
mailing list