[erlang-questions] correct terminology for referring to strings

Thu Aug 2 06:39:58 CEST 2012

On 2/08/2012, at 3:18 PM, Eric Moritz wrote:

> 
> > There is no byte sequence valid in UTF-8 that is not also
> > valid in Latin-1.
> 
> This is incorrect.

Let's be pedantic here.
There is no sequence of bytes B such that
(1) B conforms to the rules of UTF-8 and
(2) B can also be decoded as Latin 1

This is 100% correct.
> 
> Latin-1 code points are a subset of Unicode codepoints.

True and totally irrelevant.  The statement in question has
nothing to say about codepoints.

> Codepoints are not bytes.

Also true and totally irrelevant.  The statement in question
has nothing to say about codepoints.

> Codepoints are indexes in character tables. latin-1 is a table of a possible 256 characters where as Unicode is at this point a table of more that 100,000 characters.  There are actually codepoints in the range of 127-159 which are unused and if used are technically invalid Latin-1 and Unicode.

I suppose it depends on what you mean by "Latin 1".
If you look at the code tables in
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-094.pdf
you're right: 127 is not there.

But then neither are TAB, CR, or LF there.

If you want to talk about "Latin 1" in any sense that includes
those control characters, you have to admit the others.
The framework is specified by ECMA 43, which requires ESC and
DEL.  So byte 127 is not invalid.
If you want TAB, CR, LF, and so on, then you get them from
ECMA 48, the C0 set.  Bytes with values 128 to 159 *also* come from
ECMA 48.

So when I talk about "Latin 1" I mean all the printing characters
*and* all the ECMA C0 and C1 control characters.

It's not just me.  Look for example at
http://www.madore.org/~david/computers/unicode/cstab.html#Latin-1
which shows the control character names in red.
More importantly, look at the mapping tables produced by the
Unicode consortium, specifically 8859-1.TXT.
0x7F    0x007F  #       DELETE
0x80    0x0080  #       <control>
...
0x9F    0x009F  #       <control>
0xA0    0x00A0  #       NO-BREAK SPACE

The Unicode consortium think that 0x7F to 0x9F are Latin-1
control characters -- they use #UNDEFINED to mark characters
that are not defined at all in the source character set --
and for what it's worth, U+007F to U+009F are listed in the
Unicode character data base as *defined* characters with
class Cc, and they formerly even named the functions they
perform.
> 
> When it comes to the binary representation of these codepoints.

I specifically wrote about BYTE SEQUENCES.  Nothing else is
relevant.  I did not write about codepoints.

>  Latin-1 is encoded as literal bytes because all codepoints are less than 256.

You can encode Latin 1 in all sorts of ways.
Bytes work because it's a member of the
ECMA "8-Bit Coded Character Set" family.'

>  Unicode codepoints on the other hand can be larger than 255 so in order to represent them as bytes they need to be encoded.

That's not relevant.  It doesn't matter *what* UTF-8 encodes here,
the only point is that since a UTF-8 sequence is a byte sequence,
and since every byte sequence is a valid Latin 1 encoding, there
is no byte sequence that is a valid UTF-8 sequence but not a valid
Latin 1 sequence.

There are of course many ways to encode Unicode as sequences of bytes.
We could, to be ridiculous, represent each Unicode codepoint as a
sequence of 21 bytes each with value 0x30 or 0x31.  More realistically,
SCSU and BOCU have advantages.  The thing is, there is no byte
sequence that cannot be interpreted as representing a sequence of
Latin 1 characters (including control characters), so there is no way
of being certain what you have.

Of course an XML document must start with zero or more white space
characters followed by a left angle bracket.  A higher level protocol
like that _may_ impose constraints that let you figure out what you
have.  Similarly an Erlang module must start with a zero or more
white space characters or % comments followed by a hyphen-minus character.
That is enough to allow XML-style discrimination between big- and
little-endian 4-byte and 2-byte representations, some flavour of
EBCDIC, and some extension of ASCII, but not to discriminate between
Latin 1 and UTF-8.

I've deleted the rest of the message as also beside the point.