Language change proposal

Mon Nov 3 02:48:38 CET 2003

I wrote that:
> One of the long term goals for Erlang is that it should support Unicode;

Note that this wasn't a _recommendation_, it was a straightforward
report of fact.  Jonas Barklund's "es_std_0.6.ps" states plainly in
section 3 that "Standard Erlang" uses Unicode.  As it happens, I DO
recommend that Erlang should support Unicode, but it wasn't me that
said it back in 1998.

Joachim Durchholz <joachim.durchholz@REDACTED> wrote:,
	This is something that I'd advise against.
	...
	1) If somebody gives me software to maintain, I might hit a,
	say, Chinese glyph somewhere.  I'd have to download the proper
	font just to be able to look at the sources.

es_std_0.6.ps describes \u escapes exactly like Java/C++/C99.
This means that it is quite untrue that you would need special fonts
to look at sources.  (Not that suitable free fonts haven't been available
for several years now, ...)  Not that the result would be particularly
readable, but then, if I gave you source code full of identifiers like
"waea", "tuhituhi", "kupenga", "tiimata", and so on, you wouldn't find
that particularly readable despite it using none but ASCII letters.

You cannot realistically maintain software in a language you can't at
least read, which of course is why Chinese Erlang programmers should be
allowed to use Chinese words.

	2) There are many glyphs that look the same.  For example, that
	"a" letter might actually have an entirely different encoding
	since it's from the Russian alphabet.

True: U+0430 CRYLLIC SMALL LETTER A.  It's not clear how much of a 
problem this is in practice.  In any case, since people expect to work
with XML, this is a problem Erlang *has* to live with somehow.

	Unicode also has issues with letter case.

More precisely, the world's scripts have issues with letter case.

	For one, there is no good mapping of lowercase and uppercase
	letters (and cannot be:  for example, the German <ss> has no
	uppercase equivalent, it transliterates to SS or SZ depending on
	personal whim).

Case conversion is not a simple one-to-one mapping.  That's not Unicode's
fault, that's just the way things are.  There are, for example, two
conventions for converting lower case to upper case in French (lose the
accents/keep the accents).  There's the point, spelled out in the Unicode
book itself, that the Turkish upper case equivalent of "i" is not "I" but
capital-I-with-dot-above, and the Turkish lower case equivalent of "I" is
not "i" but lower-case-dotless-i.  Since Erlang is a case sensitive language,
this is a non-problem:  you don't care about case conversion when processing
Erlang sources because you don't ever do it.  When it comes to data, it's
up to the application to decide whether to use locale-sensitive case mapping
or the case mapping tables that are available free from unicode.org.

	Additionally, Unicode has /three/ lettercase categories: lower, upper, 
	and title case. (The latter information is gleaned from the Haskell 
	language report, I don't know anything further about Unicode.)

This is true.  Again, I don't see what the problem is.  If you want to find
a stick to beat Unicode with, there are stouter ones.  (Like the fact that
the encoding of a glyph is not unique, and there is a bewildering choice of
normalisation forms.)

	(There's also a portability issue: there are still EBCDIC machines 
	around that don't support Unicode. I don't think this is relevant for 
	Erlang though *g*)

What machines are those?  Certainly not IBM ones; z/Architecture has
hardware support for Unicode.  If it comes to that, there are probably
still a few PDP-11s in service that only support ASCII.  What of it?

	My personal idea about Unicode is that it is massively overengineered 
	for simple tasks like representing source code.

It is, on the other hand, the only international widely supported large
character set standard around, and it _wasn't_ engineered just for simple
tasks.

	What are the advantages of keeping some XML data as atoms?

The same as the advantages of keeping any other data as atoms.  Atoms are
physically compact and testing for atom equality is very fast; if you want
to write a program that transforms XML to something else, you'd be mad to
do it in XSLT if you could do it in Erlang, and that means pattern matching
against XML trees is interesting.  SWI Prolog doesn't just store generic
identifiers and attribute names as atoms, it stores #PCDATA as atoms as well,
and SWI Prolog is used with very large RDF files.  (Mind you, SWI Prolog is
a multithreaded system whose atom table _is_ garbage collected.)

	About ISO Latin and Windows:  That's one of the reasons why I
	don't use umlauts in my source code, except when it comes to
	literal strings.  And I'm painfully aware that having umlauts in
	strings makes my sources nonportable; the better solution is to
	have some internationalization support.

As a matter of fact, vowels with umlauts and the sharp-s character are
no trouble at all:  ISO Latin 1, ISO Latin 9 (=8859-15), MacRoman, and
Windows all support them perfectly well.  The big problem is things like
English quotation marks, which MacRoman and Windows support, but none of
the ISO 8859 character sets.

Even if we confine Erlang to 8-bit character sets, people DO have reason
to use different 8-bit character sets, and some way of indicating _which_
8-bit character set was used is going to be increasingly important.  (I
repeat my observation that ISO Latin 9 has a Euro character and ISO Latin 1
does not, so there is a strong incentive for Europeans to switch to
ISO Latin 9 as their default character set.)

PS: the words are "wire", "write", "net", and "start".