Strings (was: Re: are Mnesia tables immutable?)

Tue Jun 27 06:41:43 CEST 2006

"Ryan Rawson" <ryanobjc@REDACTED> wrote:
	There is a general perception that Erlang is no good at strings.

That perception is definitely mistaken.

	Part of the issue is the whole 'lists of integers' and people
	freak on the memory requirements.

It's just like the way that the credulous swallow the Da Vinci Code.
They just don't check for themselves.  (Brown has Langdon go into
raptures about ((1+sqrt(5))/2) and gets many of his facts so far wrong
that you'd think he was a whole government department.  But he does
actually _tell_ his students to measure their own
(head-floor)/(navel-floor), and any readers who _did_ that almost
surely found their own ratios weren't even close to phi.  And for bees
he is out by thousands.)

	The other part I think is the Unicode support.
	Care to speak to that?

Ah, Unicode.
I've struggled to provide Unicode support in one language and
written a draft library proposal for another.
It's quite frighteningly hard.  (Interesting point:  the C99 standard
does _not_ provide enough information about the current locale to
implement POSIX regular expressions, and POSIX itself doesn't provide
access to the information you need either.  Guess how come I found that
out?)  I'm not even sure what "support" for Unicode language tags (the
absence of which was one of the core design features originally, so it
defies belief that they added them) would begin to look like.

Simply telling when two sequence of Unicode codepoints represent the
"same" string is the very reverse of trivial.   Not that there is a
single definition.  Unicode has four "normal forms", so there are
five different definitions of "same", counting the trivial one, and
_not_ counting differences between versions of Unicode (I see Unicode 5
is on the horizon already).  And there isn't even a normal form with
the property that is_nf(Xs) and is_nf(Ys) => is_nf(Xs ++ Ys).  Yep,
concatenating two normalised strings (any of the four definitions) can
give you a result that is _not_ normalised.

Mind you, Unicode support in C and C++ is extremely weak too, unless
you use the Taligent/IBM International Components for Unicode (icu4c,
icu4j, see icu.sourceforge.net).