<p dir="ltr"><br>
> There is no byte sequence valid in UTF-8 that is not also<br>
> valid in Latin-1.</p>
<p dir="ltr">This is incorrect.</p>
<p dir="ltr">Latin-1 code points are a subset of Unicode codepoints. </p>
<p dir="ltr">Codepoints are not bytes. Codepoints are indexes in character tables. latin-1 is a table of a possible 256 characters where as Unicode is at this point a table of more that 100,000 characters. There are actually codepoints in the range of 127-159 which are unused and if used are technically invalid Latin-1 and Unicode. </p>
<p dir="ltr">When it comes to the binary representation of these codepoints. Latin-1 is encoded as literal bytes because all codepoints are less than 256. Unicode codepoints on the other hand can be larger than 255 so in order to represent them as bytes they need to be encoded. </p>
<p dir="ltr">Latin-1 bytes larger than 126 are not the same character in UTF-8 because UTF-8 uses the 8th bit for encoding multi byte sequences to represent Unicode codepoints which are larger than 126. So while values in a list greater than 126 are valid Latin-1, if those values represent UTF-8 bytes, the characters are not the same.</p>
<p dir="ltr">For instance, 233 is the codepoint for an accented e in Latin-1 and Unicode, the binary representation of that character in Latin-1 is literally the byte <<233>> but when the codepoint is encoded as UTF-8, it is the bytes <<195,169>>. </p>
<p dir="ltr">The list [195,169] is never going to be an accented e in Erlang because as far as Erlang is concerned, that is a list of Latin-1 codepoints which are the characters à and ©. Ever see Café on a webpage? That is because they told the browser that their HTML was latin-1 when it was actually UTF-8.</p>
<p dir="ltr">It just so happens that [195,169] is also of type chardata() because all valid latin-1 codepoints are also valid Unicode codepoints. In either case, [195,169] is not an accented e. At the very least it is a list of integers whose values represent UTF-8 encoded bytes but until you convert those UTF-8 bytes to Unicode codepoints it'll never be chardata() with the correct characters. </p>
<p dir="ltr">To summarize: Unicode is a table of codepoints. A codepoint is an index in the table. UTF-8 is a codec for turning codepoints to and from bytes. UTF-8 cannot be used to refer to what Erlang calls chardata(). chardata() is a list of integer() whose value is a valid Unicode codepoint. UTF-8 can only refer to a sequence of bytes. <br>
<br>
Eric.<br>
</p>