[erlang-questions] correct terminology for referring to strings

Thu Aug 2 05:18:47 CEST 2012

> There is no byte sequence valid in UTF-8 that is not also
> valid in Latin-1.

This is incorrect.

Latin-1 code points are a subset of Unicode codepoints.

Codepoints are not bytes. Codepoints are indexes in character tables.
latin-1 is a table of a possible 256 characters where as Unicode is at this
point a table of more that 100,000 characters.  There are actually
codepoints in the range of 127-159 which are unused and if used are
technically invalid Latin-1 and Unicode.

When it comes to the binary representation of these codepoints.  Latin-1 is
encoded as literal bytes because all codepoints are less than 256.  Unicode
codepoints on the other hand can be larger than 255 so in order to
represent them as bytes they need to be encoded.

Latin-1 bytes larger than 126 are not the same character in UTF-8 because
UTF-8 uses the 8th bit for encoding multi byte sequences to represent
Unicode codepoints which are larger than 126. So while values in a list
greater than 126 are valid Latin-1, if those values represent UTF-8 bytes,
the characters are not the same.

For instance, 233 is the codepoint for an accented e in Latin-1 and
Unicode, the binary representation of that character in Latin-1 is
literally the byte <<233>> but when the codepoint is encoded as UTF-8, it
is the bytes <<195,169>>.

The list [195,169] is never going to be an accented e in Erlang because as
far as Erlang is concerned, that is a list of Latin-1 codepoints which are
the characters Ã and ©.  Ever see CafÃ© on a webpage? That is because they
told the browser that their HTML was latin-1 when it was actually UTF-8.

It just so happens that [195,169] is also of type chardata() because all
valid latin-1 codepoints are also valid Unicode codepoints. In either case,
[195,169] is not an accented e. At the very least it is a list of integers
whose values represent UTF-8 encoded bytes but until you convert those
UTF-8 bytes to Unicode codepoints it'll never be chardata() with the
correct characters.

To summarize: Unicode is a table of codepoints.  A codepoint is an index in
the table. UTF-8 is a codec for turning codepoints to and from bytes. UTF-8
cannot be used to refer to what Erlang calls chardata(). chardata() is a
list of integer() whose value is a valid Unicode codepoint. UTF-8 can only
refer to a sequence of bytes.

Eric.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120801/95361ea9/attachment.htm>