Language change proposal

Richard A. O'Keefe ok@REDACTED
Wed Nov 5 03:19:31 CET 2003


I proposed
    -erlang(Encoding, Version).

"Michael Hobbs" <michael@REDACTED> came up with the obvious
question:
	This presents a chicken-or-egg problem in that how is an XML processor to
	process an encoding declaration before it knows what the encoding is?
	
The XML specification spells this out in as much detail as one could
possibly wish.

    document ::= prolog element Misc*
    prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
    XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'

Either an XML document has an XML declaration or it doesn't.
If it doesn't, the encoding must be UTF-8 (or, maybe UTF-16 with
a Byte Order Mark).
If it does have an XML declaration, then the first 5 characters
must literally be '<?xml'; no white space is permitted before
the XML declaration.

Appendix F of the XML specification is non-normative, but it explains
how you can automatically detect the encoding.  Not perfectly.  But
well enough to read the encoding declaration.  You see, if there is
an XML declaration, then the first character MUST be '<', and whatever
the document begins with must be an encoding of that character.
So it's quite easy to distinguish between
UCS-4 (big-endian)    UCS-4 (little-endian)  UCS-4 (nuxi order)
UCS-2 (big-endian)    UCS-2 (little-endian)
some version or extension of ISO 646 (ASCII family)
some version of EBCDIC

The only characters that may appear in an XML declaration are
< ? > ' " =
space, tab, cr, lf,
a-z A-Z 0-9
- _ . :
and these are all in the invariant part of ISO 646.  I'm not sure whether
"_" has the same encoding in every version of EBCDIC, but anything you
find in an encoding which is _not_ a letter, digit, hyphen, dot, or colon
may be assumed to be an underscore.

The following encoding names are defined by XML:
UTF-8
UTF-16
ISO-10646-UCS-2
ISO-10646-UCS-4
ISO-8859-1 ... ISO-8859-9 (presumably this should go up to ISO-8859-15)
ISO-2022-JP
Shift_JIS
EUC-JP
with a recommendation that other encoding names be taken from then
IANA registry, and matching should ignore case.

Since an Erlang source file would either literally begin with an -erlang
declaration or else not have one at all, we could pull exactly the same
kind of auto-detection trick, looking for a "-" instead of "<".  To
better fit Erlang syntax, we'd convert the XML/IANA names to lower case
and replace '-' by '_', so
-erlang(iso_8859_1, [10,3,1]).

	So, to bring the wagons back around to Erlang, if there ever is
	an -erlang(Encoding, Version) declaration, it would be nice if
	it is clearly stated what encoding should be used for the
	"-erlang(Encoding, Version)" text.
	
The same as the encoding used for the rest of the file, of course.
Just exactly like XML.  (People do actually _read_ the XML specification
before spouting about it, don't they?)




More information about the erlang-questions mailing list