[erlang-questions] string:lexeme/s2 - an old man's rant

Thu May 9 14:28:03 CEST 2019

On Wed, 8 May 2019 10:53:25 +1200
"Richard O'Keefe" <raoknz@REDACTED> wrote:

> For what it's worth, in Unicode, Line Separator and
> Paragraph Separator are the recommended characters, with
> CR, LF, CR+LF, and of arguably NEL (U+0085) being
> "legacy".

Does that matter in a function not called `uc_lines` or
such?

> Again for what it's worth, Unicode defines an algorithm
> for breaking text into word( token)s.

"a=1&b=2&c=me+tomorrow"

"b=2" is no word, would UC call that a "token"? and if so,
would or should that matter to the user?

I would say that UC *is* an algorithm and no mere encoding
anymore. My impression is it has taken some wrong turns and
is now rolling down some strange hill driven by its mere
weight. It seems to push any available metadata onto each
character and then disunifies them in a way that makes
every single character reflect half of its context without
ever asking: Who is ever going to enter all these
correctly? even if the glyphs for those in a font happen to
be distinct. In this context I often picture a Norwegian
professor at the blackboard writing in Norsk (ø, Ø) about
empty sets (∅) and average (⌀) diameters (⌀) .... And then
again it does not, as 'average' and 'diameter' are the
same ...

And combining stuff requiring "canonisation"(?) and
allowing funny things like "◌̈ø" and utter rubbish "◌̈å" ...

And please tell the Maaori to replace "wh" with "f", "ng"
with "g" and, as some already do, macrons with double
vocals. ;-)

Simplify, simplify, simplify; things gets complex more than
enough all on their own ... no, wait, things have already
got ...

~Michael

-- 

That which was said, is not that which was spoken,
but that which was understood; and none of these
comes necessarily close to that which was meant.