[erlang-questions] String encoding and character set

Mon Jan 22 06:13:44 CET 2007

I pointed out that the presence of floating diacriticals makes
the ordinary word "character" ambiguous when applied to Unicode.

dda <headspin@REDACTED> wrote:
	Those "floating diacritics" are handled through Normalisation.

No, they are not.  Normalisation means that each combination of
base character plus diacriticals will be encoded in one and only one
way (just _which_ way depends on which of several different kinds of
normalisation you choose), but it does not and cannot eliminate any
and every floating diactrical.

For example, consider the sequence
    U+0028 LEFT PARENTHESIS
    U+0302 COMBINING CIRCUMFLEX ACCENT
This will *display* as a single "character" (equivalent to "\^(" in TeX)
but it is two *codepoints* and Normalisation will do NOTHING about this,
because these codepoints are already in the right order and there is no
precomposed left parenthesis with circumflex the sequence could be normalised
to.

	An intelligent set of string functions should be able to normalize
	strings 

Indeed it should.  But in those *normalised* strings some characters will
be one codepoint and many will NOT, and there is no easy way to tell which
are which without looking at each.

	and extract a character correctly whether it was originally
	encoded on one or two codepoints.

Originally shmoriginally.  That doesn't matter.  *AFTER* normalisation
some characters will require more than one codepoint, and anyone who
doesn't understand that does not yet understand Unicode.

	Parsing only with regexes is slow

Compared with what?  And be sure to compare apples with apples:
regular expressions *can* be compiled to native code and in my view
*should* be.  Most regular expression implementations compile to some
sort of byte code, but that's not the fault of regular expressions as
such.

	even on languages that have fast-ish regexes [Erlang's alas not
	in this case].

Be specific.  Which languages (more precisely, which libraries) do you have
in mind?  Bearing in mind that there is no reason whatever why they
*couldn't* compile to native code, which ones did you measure that do?

	While I do like regexes, there's a lot that can be done faster
	with dedicated string manipulation functions.

The last time I was able to enumerate all the programming languages I
knew, the list ran to about 200.  I have met remarkably few with tolerable
"dedicated string manipulation functions".  I still dwell with glee on
the performance comparisons I did between Xerox Quintus Prolog (using
2-word list cells and definite clause grammars for string manipulation)
and Xerox Interlisp-D (using 1 byte per character and with special helper
microcode).  The general mechanism was *faster*.  (In retrospect, the main
reason was probably that using DCGs for parsing instead of string functions
meant a whole LOT less copying.)

I repeat that 
	> The fundamental operations on strings are
	>
	>     (1) decode binary to string using some encoding
	>     (2) encode string to binary using some encoding
	>     (3) compare using locale- and application-appropriate rules
===>	>     (4) parse, typically using regular expressions
===>	>     (5) unparse

There is no "one size fits all" string representation.  Some make
parsing easy, some make unparsing easy;.