Strings (was: Re: are Mnesia tables immutable?)

Wed Jun 28 06:00:45 CEST 2006

Richard A. O'Keefe wrote:
> Thomas Lindgren <thomasl_erlang@REDACTED> wrote:
> 	Assuming one would want to implement a character
> 	datatype for Erlang, do you think Unicode is the way
> 	to go? Or something else (e.g., settle for UTF-8)? Or
> 	some sort of framework a la Common Lisp? Or wait and
> 	see a bit longer?
>
> UTF-8 is a fine thing, but it is not a *character* data type.
> UTF-8 is a way of encoding *strings*.
[deleted]

Thanks for your explanations.
I still see an interest in an Erlang data type for strings, or in 
a way to "tag" an integer list as a string, when considering the 
Erlang external encoding format (cf. term_to_binary/1).

In Erlang, strings are represented as flat lists of integers <= 
255. When transfering strings or any kinds of lists, nobody 
cares about encoding, and uses term_to_binary/1 (well, the 
emulator's external encoding functions). term_to_binary/1 
is "smart" when encoding lists: it visits the list, and if it 
detects that the list is flat and all elements are integers are 
<= 255, then it encodes it as a string (a special 
representation!), otherwise as a list of terms.

The string external format is compact:
byte 0: STRING_EXT tag
bytes 1-2: number of bytes (16 bits)
bytes 3-: the bytes that form the string

I.e., a string of N ASCII characters is encoded into N+3 bytes.

However, the list format is much less efficient when transfering 
a flat list of integers:
byte 0: LIST_EXT tag
bytes 1-4: number of terms (32 bits), not counting the tail
bytes 5-: the terms
last bytes: the tail, which is [] when the list is flat (encoded 
as one byte: the NIL_EXT tag).

Integer terms are encoded into different forms depending on their 
value. Since Unicode considers encoding every character into up 
to 32 bits, not more, let's consider the different encoding 
formats available for 32-bit integers:
(1) 0<=x<=255:
byte 0: SMALL_INTEGER_EXT tag
byte 1: x (8 bits)
(2) -134217728<=x<=134217727
byte 0: INTEGER_EXT tag
bytes 1-4: x (32 bits)
(3) all other cases:
byte 0: SMALL_BIG_EXT
byte 1: length (== 4)
byte 2: sign (8 bits!)
bytes 3-6: abs(x) (32 bits)

There is also the LARGE_BIG_EXT encoding format, for larger 
integers, but we don't need to consider this for 32-bit values.

Let's go back to flat lists of integers. If a flat list of 
integers contains values < 0 or > 255, then the total encoding 
size is:
6 + (2 to 7 bytes for every integer)
Then, in the worst case, a list of N integers can be encoded into 
6+N*7 bytes!!!

Then, we can consider how to represent strings in Erlang:
either (1) we represent a string as a list of integers, where 
every integer is the Unicode character as a 32-bit value,
or (2) we represent a string as a list of integers <=255, which 
is the UTF-8 encoding (or just any encoding) of the string.

The problem with (1), is that we end up with a very inefficient 
external encoding of strings (as a list, see above). The real 
problem, is that we have no way to specify that a list is a 
string, and that it should be encoded, e.g. into UTF-8, when 
externally encoded e.g. by term_to_binary/1.
In case (1), to have a more efficient representation of strings, 
encoding/decoding must be done by programmers explicitly, e.g. 
by doing:
Pid ! to_utf8(String)
instead of directly:
Pid ! String

But who does that? I have seen nobody caring about encoding of 
strings in Erlang code.
At least, in Java Strings are distinct data type, and the APIs 
force everybody to consider the encoding/decoding of strings. It 
is not transparent, and it is a Good Thing (when considering 
external representation).

Case (2) introduces a lot of problems: the length of a string may 
no more be the number of characters, and a character may be 
encoded as several terms in the list. This would surely break a 
lot of existing code, and makes strings handling awkward.

IMHO the best solution is to have something like Java: represent 
internally every character as one term (solution (1)), but we 
should have a way to "tag" a list to specify that it is a 
string, and therefore should be encoded appropriately. The 
heuristic in the emulator, which considers a list as a string 
iff it is flat and contains only integers <=255, is not 
appropriate to handle Unicode.
In addition, the external representation of strings should 
transmit for every string, for instance in 1 additional header 
byte, the encoding format of the string. E.g., well-known 
constants for that byte could be:
0 = "ASCII"
1 = "UTF-8"
etc.

-- 
Romain Lenglet