[erlang-questions] Erlang basic doubts about String, message passing and context switching overhead

Wed Jan 11 22:28:16 CET 2017

> Not to start a holy war, but Unicode is not complex, it just has a lot of
> tables.

Wrong.  Unicode really is hugely complex.

John Wheeler is quoted as saying
 "If you are not completely confused by quantum mechanics,
  you do not understand it."
Now apply this to Unicode:
  If you think you understand Unicode, you don't.

There are language tags.
(This is something that was explicitly ruled out of the Unicode
design, then came back in after all.)

There are variation selectors.

Unicode has support not just for text of varying
directionality, but for *mixed* directionality,
with the result that interpreting a Unicode string
requires maintaining a direction stack, with characters
like POP DIRECTIONAL FORMATTING.

I am not saying that you have to do this yourself;
what I am saying is that chopping up a Unicode string
is in general not a meaningful operation.

Another example: there's a set of characters like this:
take next two trees and paste them horizontally
take next three trees and paste them horizontally
take next two trees and paste them vertically
take next three trees and paste them vertically
used for approximating Chinese characters not yet
supported (and yes these things do nest).

This means that in order to move forward one "character"
in a string, it is necessary to parse a tree (no, a
regular expression cannot do this; these things are
*nested* and regular expressions can't do matching
brackets).

Did I mention the rules for emoji?  Look at the rules
for emoji and weep.

I repeat: moving forwards or backwards ONE
user-oriented character in a Unicode string is HARD.

> And pretty much any modern language has immensely better Unicode
> support built in than Erlang.

Pretty much any modern language is moving in this area;
Unicode support in modern Fortran and modern COBOL and
modern C is not that great.  (ICU4C is not part of any
C standard.)

One of the things that make Unicode difficult is that things
keep *changing*.  They try very hard to keep things stable,
but characters do still from time to time change class
(what was once an upper case letter may become a sign, for
one example).

I had code for moving backwards and forwards that took into
account the difference between base characters and floating
diacriticals.  It *didn't* take variation selectors into account.
(Because when I wrote the code there weren't any.)
Nor did I handle language tags.
(Because at the time language tags had been ruled out forever.)