Strings (was: Re: are Mnesia tables immutable?)

Tue Jun 27 07:21:54 CEST 2006

On 6/26/06, Richard A. O'Keefe <ok@REDACTED> wrote:
> "Ryan Rawson" <ryanobjc@REDACTED> wrote:
>         There is a general perception that Erlang is no good at strings.
>
> That perception is definitely mistaken.

It's called "education" I believe.  The best part about this list, is
it becomes archived online, and google when searching for 'erlang
<some problem>' it returns results from this list archive very often.
Meaning your replies create future knowledge for young'uns. :-)

>
>         Part of the issue is the whole 'lists of integers' and people
>         freak on the memory requirements.
>
> It's just like the way that the credulous swallow the Da Vinci Code.
> They just don't check for themselves.  (Brown has Langdon go into
> raptures about ((1+sqrt(5))/2) and gets many of his facts so far wrong
> that you'd think he was a whole government department.  But he does
> actually _tell_ his students to measure their own
> (head-floor)/(navel-floor), and any readers who _did_ that almost
> surely found their own ratios weren't even close to phi.  And for bees
> he is out by thousands.)
>
>         The other part I think is the Unicode support.
>         Care to speak to that?
>
> Ah, Unicode.
> I've struggled to provide Unicode support in one language and
> written a draft library proposal for another.
> It's quite frighteningly hard.  (Interesting point:  the C99 standard
> does _not_ provide enough information about the current locale to
> implement POSIX regular expressions, and POSIX itself doesn't provide
> access to the information you need either.  Guess how come I found that
> out?)  I'm not even sure what "support" for Unicode language tags (the
> absence of which was one of the core design features originally, so it
> defies belief that they added them) would begin to look like.
>
> Simply telling when two sequence of Unicode codepoints represent the
> "same" string is the very reverse of trivial.   Not that there is a
> single definition.  Unicode has four "normal forms", so there are
> five different definitions of "same", counting the trivial one, and
> _not_ counting differences between versions of Unicode (I see Unicode 5
> is on the horizon already).  And there isn't even a normal form with
> the property that is_nf(Xs) and is_nf(Ys) => is_nf(Xs ++ Ys).  Yep,
> concatenating two normalised strings (any of the four definitions) can
> give you a result that is _not_ normalised.
>
> Mind you, Unicode support in C and C++ is extremely weak too, unless
> you use the Taligent/IBM International Components for Unicode (icu4c,
> icu4j, see icu.sourceforge.net).

Then perhaps icu4c should be used as the basis of providing Erlang
unicode support?  I'm not really a unicode expert, so I don't know
what is involved, but if a library provides good support, then it can
be the core basis of a Erlang library, no?

-ryan