[erlang-questions] utf8 in source files

Allan Wegan allanwegan@REDACTED
Mon Nov 8 21:53:22 CET 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> The question is "only" how it should work to be as useful as possible
> and still reasonably enough backward compatible.

It should assume that a BOM is present at the start of the file. Then
fall back to UTF-8 if none is found. And fall back to ISO-8859-1 if
UTF-8 parsing failed.
This procedure is suggested for files with unknown encoding in Chapter
"Heuristic identification of UTF-8" in the Erlang documentation at
<http://www.erlang.org/doc/apps/stdlib/unicode_usage.html#id58696>.


Most valid character combinations of ISO-8859-1 that also are valid
UTF-8 look weird even in non-English texts. Some examples:
Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â? Â?
Â? Â? Â? ¡ ¢ £ ¤ Â¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶
· ¸ ¹ º » ¼ ½ ¾ ¿ � � � � � � � � � � � � � � �
� � � � � � � � � � � � à á â ã ä å æ ç è é ê ë
ì Ãà ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

There are in fact as many valid combinations, as there are valid Unicode
code points. But the chance of recognizing a file as valid UTF-8 where
it is in fact encoded in ISO-8859-1 is really small.

- -- 
Allan Wegan
Jabber: allanwegan@REDACTED
ICQ:    209459114
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (MingW32)

iQEcBAEBAgAGBQJM2GNCAAoJENm5axHh7Acxe2EH/A4lmp4dxvButPapWREvuePk
wg+PnoRUU9PAue4sspPddYOPhdqlD5lWXVL86hUG84uIDiKNx2PSoBL0WT0XeQPj
E6g21100HpjTcyI0qcBoWpj2KKurAGVtc3TP6eFbHn21X9AsP3CAckj1xyiy2BRh
4R16kDUABhgp4SkwP0D51wgRcCjaKmZHJvHPhCZFfogk/vs1SWajZ/JBi/Z8DDkV
rlBS4ddu1+AMICFGlybV/99uqh8nc2YTQTl1HqyPf10KMz676ar1m5P75ZUw0wdd
khLXyEs+gzOD5vxeEZ3uflFPQm2UwBM6+J9N5X0hiV2McTG5pLp69ccP5375F68=
=pNEA
-----END PGP SIGNATURE-----


More information about the erlang-questions mailing list