Strings (was: Re: are Mnesia tables immutable?)

Thu Jun 29 06:29:48 CEST 2006

I described a replacement for the representation of strings in Erlang's
external term form which
(1) is extremely fast to decode
(2) is tolerably space-efficient; most of the Indic scripts will fit in
    2 bytes per character, even Bopomofo.  And for that matter, Thai.

Romain Lenglet <rlenglet@REDACTED>
continues to press for a far more "heavyweight" solution.

	The most efficient is still most often to use an official 8-bit 
	encoding for strings. E.g. for Thai, TIS-620 is the most 
	efficient, for Japanese, ISO-2022 (or others) is the most 
	efficient, etc.

I don't see any reason to believe that ISO-2022 is the most space
efficient representation for Japanese.  For one thing, ISO-2022 doesn't
have anything to say about Japanese characters as such.  For another,
ISO-2022 is a framework for streams using multiple code-sets and using
"announcers" to switch between codes.  Those announcers are rather longer
(3 bytes) than the single byte that SCSU would take to switch to DBCS.

Perhaps Romain Lenglet has ISO-2022-JP or (or -JP-1, -JP-2, or -JP-3),
which is an application of ISO 2022, in mind.  It would be interesting
to know whether those are more space efficient than SCSU or not; I
suspect not.

So we have two possible approaches here:

(A) Encode sequences of integers that could be Unicode code-points
    using variable-byte encoding (instead of one byte per number).
  + Extremely fast to encode and decode.
  + Never worse than UTF-8 and often better.
  - Not as compact as script-specific encodings.

(B) Encode sequences of integers that could be Unicode code-points
    using the standard Simple Compression Scheme for Unicode (Unicode
    Technical Standard 6, http://www.unicode.org/reports/tr6/).
  + Fairly simple to decode.  (My SCSU decoder is 103 SLOC of Smalltalk,
    including tables.)
  + Gets down to 1 byte per character for the commoner alphabetic scripts,
    including Latin, Greek, Cyrillic, Arabic, and the Indic scripts,
    amongst others.
  + Gets down to 2 bytes per character for BMP characters including the
    more common CJK characters and averages less than 2 bytes per
    character for Japanese.
  + The report comes with a sample encoder, which is 75 lines of C.
  - The sample encoder doesn't use all the features of SCSU and writing
    and encoder that does better requires careful design.

Looking at TIS 620.2533 (http://www.langbox.com/codeset/tis620.html)
I see that it is a typical "ASCII+top half" character set, and that
the top half is simply a shifted version of the Unicode code points
U+0E01 to U+0E5B (which probably means that the Thai block in Unicode
was copied from TIS 620 in the first place).  And what _that_ means is
that this is _exactly_ the kind of alphabetic script that SCSU was
designed to support efficiently.  The U+0E00..U+0E7F half-block is
half-block 28, so to encode Thai characters using SCSU would take
<16r18 {SD1}, 16r1c {Thai}> followed by ONE BYTE PER CHARACTER.
(The "minimal encoder" in section 8.4 of UTR-6 won't do this well.  But
an encoder that buffers a few characters before deciding whether to set
a dynamic window should be able to cope easily.)

Let me repeat that.  To encode N characters of Thai text using TIS-620
would take N bytes.  To do so using SCSU would take N+2 bytes.

That's not a lot of overhead.  Can we do anything to reduce it?
Why yes.  Simply leave the coding state unchanged between strings,
so that a sequence of Thai strings would have *one" "set dynamic window"
opcode for all of them

	But how to know which 8-bit encoding to use?

You don't need to.  While SCSU doesn't handle _every_ possible 8-bit
character set, it _does_ handle the scripts used by several thousand
million people, with a very modest investment in tables.

	But the application usually knows, or *should* know which
	encoding to use (e.g. using the NLS configuration, etc.).

Wrong.  Just because my personal "NLS configuration" is set to
en_NZ.ISO8859-1 (on systems that have it) or en_AU.ISO8859-1 (on systems
that don't), that *doesn't* mean that the text I am processing is
or even *could be* encoded using ISO 8859-1.  As it happens, I have quite
a lot of text in Maori, which would need ISO 8859-4 (Latin 4), 8859-10
(Latin 6), or 8859-13 (Latin 7).

Yes, that means that those characters won't display correctly on my
terminal, but just because I am *storing*, *transmitting*, or *processing*
characters, that does NOT mean that I want to *display* them.

In fact, the whole point of Unicode is that I should be able to store,
transmit, and process a vast range of characters *without* having to be
concerned with a painfully large range of encodings.

	I think that we should find a way to identify lists of integers 
	as strings, i.e. as lists of Unicode code points, and 
	to "attach" a "preferred encoding" to such lists.

Why?

We've just seen that strings can be stored compactly in a revised external
term representation WITHOUT knowing anything about ANY encodings other than
Unicode and SCSU.

	Of course, the problem is that to be usable and efficient, a lot 
	of 8-bit encodings have to be known and implemented in the 
	emulator (and erl_interface), which may increase the code size 
	significantly.

But there are literally HUNDREDS of encodings.
http://www.iana.org/assignments/character-sets
lists 250 of them, and there are a lot more that aren't listed (yet).
Some of them can be handled using simple table lookup, but quite a
few cannot.

	But those encoding/decoding primitives could also be made usable
	directly by programs, which would be very useful in general.

Well, it _would_ be useful, _if_ everyone agreed about what the wretched
things were _called_.  For example, a de facto standard part of Unix these
days is 'iconv'.  Unfortunately, the character set names are *not* standard.
And I have had a program throw thousands of error messages at me because
the character set name it used internally was not one known to the version
of iconv on my system, although the actual character set _was_ known.

This is one of the reasons why asking the 'NLS configuration' about
encodings won't work very well.  There is some text, somehow you discover
that my environment is set up for ISO8859-10, and you encode 'ISO8859-1' as
the encoding name, BUT the data get sent to a system that calls the same
character set 'l6' or 'iso-latin-6' and has never heard of 'ISO8859-1'.

Using Unicode internally _everywhere_ and using SCSU as the 'packing'
mechanism _everywhere_ means NO INCOMPATIBILITIES BETWEEN ERLANGS.
(As it happens, SCSU _is_ one of the IANA character sets.)

	I was not thinking about adding a new type, but rather new 
	conventions. After all, the concept of "string" is only a matter 
	of conventions in Erlang!

Ah.

	For instance, I propose to represent a string as a 'string' 
	record:

	{string, 'utf-8', [$a, $B]}

Yi-yi-yi-yi-yi-yi-yi-YI!

I guess the big difference between us is that you think "preferred
encoding" is a good idea, and I think it makes absolutely no sense
whatever.  What looks to you like a help (and I honestly still do
not understand why) looks to _me_ like a crushing burden.

If there were a *single* common standard for encoding names, it might
make sense, except that the point of Unicode is to save programmers
(and programs) from having to know about such things.  Did I mention
that the character set which has to be called 'ISO8859-1' in a locale
name on my system is known to 'iconv' as '8859' (on one system, that
is; on another it's 'ISO8859-1' after all).  And of course you know
that the preferred MIME name for this is 'ISO-8859-1' and the official
IANA name for it is 'ISO_8859-1:1987'... Then there is the Macintosh
Roman character set, known as 'MacRoman' on one system and 'mac' on
another.  (The IANA registry calls it 'macintosh' as if Macs only ever
had one character set; I have tables for 25.)  If your program contains
    {string, "mac", [Florin,Micro]}
and my system calls it 'MacRoman' or doesn't have Macintosh Romain in
its iconv tables under _any_ name, what happens?  (And I have such a lot
of old Mac files...)

Amongst other things, it makes no sense to me to regard the preferred
encoding of a Unicode string as a property of the *string* rather than
of a *use* of a string.  I want the string "aB" to be the *same* value
every time.

By the way, we DO agree that it will be important for people to be
able to take data in an external coding known to their system, decode
it to Unicode, process it, and then encode the result back into some
external coding known to their system.

Dealing with Unicode is hard.
Dealing with the multiplicity of character encodings with multiple names
for each is hard.
Separation of concerns is one of our main tools for coping with complexity.