String representation in erlang

Wed Sep 14 16:31:56 CEST 2005

Richard A. O'Keefe wrote:
> Thinus Pollard <thinus@REDACTED> wrote:
> 	According to the Erlang efficiency guide a string is internally
> 	represented as a list of integers, thus consuming 2 words
> 	(8 bytes on a 32bit platform) of memory *per* character.
> 	
> Unicode/ISO10646 characters require in general 21 bits.
> NOT 16 bits.  Java claims to support Unicode, but there are already
> quite a lot of Unicode characters which don't fit in 16 bits, so
> many Unicode characters require two Java "chars".  Needless to say,
> this stuffs up Java string indexing something wonderful.

Indexing into UTF-16 works fine (as does indexing into UTF-8). There's
only a problem if you assume that it's necessary to index by code point
instead of by code unit.

Note that in Unicode, abstract characters are represented by *sequences*
of code points (because of combining marks). So a "position" in a Unicode
string, regardless of whether it is a code point index or a code unit
index, does not in general correspond to a count of abstract characters
from the start of the string. And for most purposes, this makes no
difference; all you need is a position, not a count.

 From section 5.4 of Unicode 4.x (http://www.unicode.org/versions/Unicode4.1.0/):

# [...] By accessing at most two code units, a process using the UTF-16
# encoding form can therefore interpret any Unicode character. Determining
# character boundaries requires at most scanning one preceding or one
# following code unit without regard to any other context.
#
# As long as an implementation does not remove either of a pair of surrogate
# code units or incorrectly insert another character between them, the
# integrity of the data is maintained. Moreover, even if the data becomes
# corrupted, the corruption is localized, unlike with some other multibyte
# encodings such as Shift-JIS or EUC. Corrupting a single UTF-16 code unit
# affects only a single character. Because of non-overlap (see Section 2.5,
# Encoding Forms), this kind of error does not propagate throughout the rest
# of the text.
#
# UTF-16 enjoys a beneficial frequency distribution in that, for the majority
# of all text data, surrogate pairs will be very rare; non-surrogate code
# points, by contrast, will be very common.

[and furthermore, non-surrogate code points are only used for characters
that aren't likely to have any special interpretation in a given format/syntax]

# Not only does this help to limit the performance penalty incurred when
# handling a variable-width encoding, but it also allows many processes
# either to take no specific action for surrogates or to handle surrogate
# pairs with existing mechanisms that are already needed to handle character
# sequences.

-- 
David Hopwood <david.nospam.hopwood@REDACTED>