[erlang-questions] correct terminology for referring to strings

Eric Moritz eric@REDACTED
Thu Aug 2 06:54:50 CEST 2012


Sorry. I took your statement out of context.
On Aug 2, 2012 12:40 AM, "Richard O'Keefe" <ok@REDACTED> wrote:

>
> On 2/08/2012, at 3:18 PM, Eric Moritz wrote:
>
> >
> > > There is no byte sequence valid in UTF-8 that is not also
> > > valid in Latin-1.
> >
> > This is incorrect.
>
> Let's be pedantic here.
> There is no sequence of bytes B such that
> (1) B conforms to the rules of UTF-8 and
> (2) B can also be decoded as Latin 1
>
> This is 100% correct.
> >
> > Latin-1 code points are a subset of Unicode codepoints.
>
> True and totally irrelevant.  The statement in question has
> nothing to say about codepoints.
>
> > Codepoints are not bytes.
>
> Also true and totally irrelevant.  The statement in question
> has nothing to say about codepoints.
>
> > Codepoints are indexes in character tables. latin-1 is a table of a
> possible 256 characters where as Unicode is at this point a table of more
> that 100,000 characters.  There are actually codepoints in the range of
> 127-159 which are unused and if used are technically invalid Latin-1 and
> Unicode.
>
> I suppose it depends on what you mean by "Latin 1".
> If you look at the code tables in
> http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-094.pdf
> you're right: 127 is not there.
>
> But then neither are TAB, CR, or LF there.
>
> If you want to talk about "Latin 1" in any sense that includes
> those control characters, you have to admit the others.
> The framework is specified by ECMA 43, which requires ESC and
> DEL.  So byte 127 is not invalid.
> If you want TAB, CR, LF, and so on, then you get them from
> ECMA 48, the C0 set.  Bytes with values 128 to 159 *also* come from
> ECMA 48.
>
> So when I talk about "Latin 1" I mean all the printing characters
> *and* all the ECMA C0 and C1 control characters.
>
> It's not just me.  Look for example at
> http://www.madore.org/~david/computers/unicode/cstab.html#Latin-1
> which shows the control character names in red.
> More importantly, look at the mapping tables produced by the
> Unicode consortium, specifically 8859-1.TXT.
> 0x7F    0x007F  #       DELETE
> 0x80    0x0080  #       <control>
> ...
> 0x9F    0x009F  #       <control>
> 0xA0    0x00A0  #       NO-BREAK SPACE
>
> The Unicode consortium think that 0x7F to 0x9F are Latin-1
> control characters -- they use #UNDEFINED to mark characters
> that are not defined at all in the source character set --
> and for what it's worth, U+007F to U+009F are listed in the
> Unicode character data base as *defined* characters with
> class Cc, and they formerly even named the functions they
> perform.
> >
> > When it comes to the binary representation of these codepoints.
>
> I specifically wrote about BYTE SEQUENCES.  Nothing else is
> relevant.  I did not write about codepoints.
>
> >  Latin-1 is encoded as literal bytes because all codepoints are less
> than 256.
>
> You can encode Latin 1 in all sorts of ways.
> Bytes work because it's a member of the
> ECMA "8-Bit Coded Character Set" family.'
>
> >  Unicode codepoints on the other hand can be larger than 255 so in order
> to represent them as bytes they need to be encoded.
>
> That's not relevant.  It doesn't matter *what* UTF-8 encodes here,
> the only point is that since a UTF-8 sequence is a byte sequence,
> and since every byte sequence is a valid Latin 1 encoding, there
> is no byte sequence that is a valid UTF-8 sequence but not a valid
> Latin 1 sequence.
>
> There are of course many ways to encode Unicode as sequences of bytes.
> We could, to be ridiculous, represent each Unicode codepoint as a
> sequence of 21 bytes each with value 0x30 or 0x31.  More realistically,
> SCSU and BOCU have advantages.  The thing is, there is no byte
> sequence that cannot be interpreted as representing a sequence of
> Latin 1 characters (including control characters), so there is no way
> of being certain what you have.
>
> Of course an XML document must start with zero or more white space
> characters followed by a left angle bracket.  A higher level protocol
> like that _may_ impose constraints that let you figure out what you
> have.  Similarly an Erlang module must start with a zero or more
> white space characters or % comments followed by a hyphen-minus character.
> That is enough to allow XML-style discrimination between big- and
> little-endian 4-byte and 2-byte representations, some flavour of
> EBCDIC, and some extension of ASCII, but not to discriminate between
> Latin 1 and UTF-8.
>
> I've deleted the rest of the message as also beside the point.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120802/9d250c3d/attachment.htm>


More information about the erlang-questions mailing list