[erlang-questions] correct terminology for referring to strings
Eric Moritz
eric@REDACTED
Thu Aug 2 06:54:50 CEST 2012
Sorry. I took your statement out of context.
On Aug 2, 2012 12:40 AM, "Richard O'Keefe" <ok@REDACTED> wrote:
>
> On 2/08/2012, at 3:18 PM, Eric Moritz wrote:
>
> >
> > > There is no byte sequence valid in UTF-8 that is not also
> > > valid in Latin-1.
> >
> > This is incorrect.
>
> Let's be pedantic here.
> There is no sequence of bytes B such that
> (1) B conforms to the rules of UTF-8 and
> (2) B can also be decoded as Latin 1
>
> This is 100% correct.
> >
> > Latin-1 code points are a subset of Unicode codepoints.
>
> True and totally irrelevant. The statement in question has
> nothing to say about codepoints.
>
> > Codepoints are not bytes.
>
> Also true and totally irrelevant. The statement in question
> has nothing to say about codepoints.
>
> > Codepoints are indexes in character tables. latin-1 is a table of a
> possible 256 characters where as Unicode is at this point a table of more
> that 100,000 characters. There are actually codepoints in the range of
> 127-159 which are unused and if used are technically invalid Latin-1 and
> Unicode.
>
> I suppose it depends on what you mean by "Latin 1".
> If you look at the code tables in
> http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-094.pdf
> you're right: 127 is not there.
>
> But then neither are TAB, CR, or LF there.
>
> If you want to talk about "Latin 1" in any sense that includes
> those control characters, you have to admit the others.
> The framework is specified by ECMA 43, which requires ESC and
> DEL. So byte 127 is not invalid.
> If you want TAB, CR, LF, and so on, then you get them from
> ECMA 48, the C0 set. Bytes with values 128 to 159 *also* come from
> ECMA 48.
>
> So when I talk about "Latin 1" I mean all the printing characters
> *and* all the ECMA C0 and C1 control characters.
>
> It's not just me. Look for example at
> http://www.madore.org/~david/computers/unicode/cstab.html#Latin-1
> which shows the control character names in red.
> More importantly, look at the mapping tables produced by the
> Unicode consortium, specifically 8859-1.TXT.
> 0x7F 0x007F # DELETE
> 0x80 0x0080 # <control>
> ...
> 0x9F 0x009F # <control>
> 0xA0 0x00A0 # NO-BREAK SPACE
>
> The Unicode consortium think that 0x7F to 0x9F are Latin-1
> control characters -- they use #UNDEFINED to mark characters
> that are not defined at all in the source character set --
> and for what it's worth, U+007F to U+009F are listed in the
> Unicode character data base as *defined* characters with
> class Cc, and they formerly even named the functions they
> perform.
> >
> > When it comes to the binary representation of these codepoints.
>
> I specifically wrote about BYTE SEQUENCES. Nothing else is
> relevant. I did not write about codepoints.
>
> > Latin-1 is encoded as literal bytes because all codepoints are less
> than 256.
>
> You can encode Latin 1 in all sorts of ways.
> Bytes work because it's a member of the
> ECMA "8-Bit Coded Character Set" family.'
>
> > Unicode codepoints on the other hand can be larger than 255 so in order
> to represent them as bytes they need to be encoded.
>
> That's not relevant. It doesn't matter *what* UTF-8 encodes here,
> the only point is that since a UTF-8 sequence is a byte sequence,
> and since every byte sequence is a valid Latin 1 encoding, there
> is no byte sequence that is a valid UTF-8 sequence but not a valid
> Latin 1 sequence.
>
> There are of course many ways to encode Unicode as sequences of bytes.
> We could, to be ridiculous, represent each Unicode codepoint as a
> sequence of 21 bytes each with value 0x30 or 0x31. More realistically,
> SCSU and BOCU have advantages. The thing is, there is no byte
> sequence that cannot be interpreted as representing a sequence of
> Latin 1 characters (including control characters), so there is no way
> of being certain what you have.
>
> Of course an XML document must start with zero or more white space
> characters followed by a left angle bracket. A higher level protocol
> like that _may_ impose constraints that let you figure out what you
> have. Similarly an Erlang module must start with a zero or more
> white space characters or % comments followed by a hyphen-minus character.
> That is enough to allow XML-style discrimination between big- and
> little-endian 4-byte and 2-byte representations, some flavour of
> EBCDIC, and some extension of ASCII, but not to discriminate between
> Latin 1 and UTF-8.
>
> I've deleted the rest of the message as also beside the point.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120802/9d250c3d/attachment.htm>
More information about the erlang-questions
mailing list