Strings (was: Re: are Mnesia tables immutable?)

Wed Jun 28 04:22:57 CEST 2006

Thomas Lindgren <thomasl_erlang@REDACTED> wrote:
	Assuming one would want to implement a character
	datatype for Erlang, do you think Unicode is the way
	to go? Or something else (e.g., settle for UTF-8)? Or
	some sort of framework a la Common Lisp? Or wait and
	see a bit longer?

UTF-8 is a fine thing, but it is not a *character* data type.
UTF-8 is a way of encoding *strings*.

Let's consider SWI Prolog, which has four ways of representing strings:
    - as lists of integers, the traditional Prolog way.
      In this form, each Unicode code-point is represented by one integer.
    - as lists of single-character atoms, the ISO substandard way.
      In this form, each Unicode code-point is represented by one atom.
    - as packed byte strings.  The internal representation used, visible
      from C, is UTF-8, but whenever *Prolog* code looks at a string,
      it sees Unicode code-points.  sub_string/5 uses character counts,
      not byte counts, to specify a slice.
    - as atoms.  Again, the internal representation, visible from C,
      is UTF-8, but whenever *Prolog* code looks at an atom, it sees
      Unicode code-points.  sub_atom/5 uses character counts, not byte
      counts, to specify a slice.
I/O can be done with bytes (get_byte/[1,2], put_byte/[1,2]) or with
Unicode code-points represented as integers (get_code/[1,2], put_code/[1,2])
or single-character atoms (get_char/[1,2], put_char/[1,2]).

Jan really did an *amazing* amount of work moving SWI Prolog over to Unicode,
and all things considered, it went amazingly smoothly.  Yes, problems are
still being reported in the SWI Prolog mailing list, BUT (a) funnily enough
they all seem to involve Windows, and (b) these days they only seem to
involve things like Prolog and C disagreeing about what the default encoding
is.  Prolog programs that don't do anything fancy just work.

So I would definitely suggest looking hard at SWI Prolog for a model, and
if anyone with money wants to see something happen with Unicode in Erlang
they could do worse than ask Jan if he wants to do some consulting...

I note that UTF-8 is not the only storage scheme around for Unicode.
There's another one which is in some ways better, and that's SCSU (the
Simple Compression Scheme for Unicode).  Compared with UTF-8, SCSU has
three weak points and two strong points:

    - if S1 and S2 are sequences of Unicode code points,        
      lex-order(S1, S2) == lex-order(UTF-8-encode(S1), UTF-8-encode(S2)).
      This is not true for SCSU.  Given that lexical order of code-points
      isn't particularly close to _any_ locale's collation order, it's
      not clear that this matters very much.

    - UTF-8 locales are available for most UNIX-like operating systems.
      This is not true for SCSU.  However, nothing says that the internal
      representation of text has to be the same as the external one.

    - SCSU encoding is not unique; there is a standard *decoding*
      algorithm, but no standard *encoding* algorithm, although there
      is an example one.

    + Latin-1 data _is_ SCSU data; for many alphabetic scripts SCSU is
      n+1 bytes or n+2 bytes.  (Useful here, where Maori uses five
      vowels with macrons, upper and lower case; being able to store
      one character per byte with just one byte of overhead is nice.)

    + CJK text requires 2n+1 bytes (one byte says "switch to 16-bit"),
      so SCSU is a good representation for CJK.

Back in the Paleolithic when computers were things you could walk inside
and programmers chipped their programs out of cardboard with flint knives,
the Burroughs B6700 MCP (= operating sytem) tagged each file with the
character set it was written in.  (OK, the choice was 6-bit BCL, 7-bit
ASCII, or 8-bit EBCDIC, but the _idea_ of each file being tagged with a
character encoding was there, and the _idea_ of the file's character set
being _automatically_ translated by the operating system into whichever
character encoding the program wanted without the program getting involved
was also there.)  The nastiest problem we have with Unicode is that UNIX
file systems don't do that.  (Windows attributes could.  I don't
understand the MacOS X file system well enough yet, but I'm pretty sure
it could do this.  And Solaris lets you attach properties to files; in
effect each file is also a directory with the properties as files.  Try
'man openat'.)  Just because the user's terminal is set up for UTF-8
doesn't mean that this file isn't in Windows 1252 or that file isn't in
MacRoman.  Of course there are ways of saying what encoding to use; the
problem is that there are too many.  XML declarations; MIME types; ISO 2022
announcers; environment variables.  This is where the remaining SWI Prolog
problems are, in conflicts between "OEM" and "ANSI" encodings arising because
more than one software component has to be told about more than one encoding
in more than one way.

If a Unicode file begins with a "byte order mark", it is easy for a program
to tell whether the representation used is UCS4-BE, UCS4-LE, UCS2-BE,
UCS2-LE, SCSU, or UTF-8.  Telling the difference between 8-bit variants of
ASCII is not so easy.

What about Common Lisp?

Well, I have CLtL1, CLtL2; I keep the HyperSpec on-line; I have three
different implementations of Common Lisp on my SPARC and two on my Mac.
There are little things like 'Every character with case is in one-to-one
correspondence with some other character with the opposite case' which
are a poor fit to non-ASCII reality:
  - in Latin 1 y-umlaut and sharp-s are definitely lower case but have
    no upper case equivalent
  - in Unicode, there are THREE cases, not two
  - in scripts such as Greek or 18th-century English the lower-case
    version of an upper-case letter is *position*-dependent (different
    forms for final-vs-non-final "s", for example)
    This issue, at least, was *certainly* understood when CLtL1 was written,
    because Interlisp-D used the XNS character set, and the XNS character
    set included position-dependent letters.
  - in Unicode, case conversion is *locale*-dependent
That is, whether you call things like CHAR-UPCASE and CHAR-DOWNCASE
part of a "framework" or not, they are *not* part of a framework that
can cope well with Unicode, because Unicode case mapping for strings is
*not* a simple matter of iterating case mapping for characters.  Much as
I love Lisp, I honestly don't see anything in Common Lisp character or
string handling that Erlang needs to learn from, and more than a little
that it would be a bad idea to copy.