[erlang-questions] Erlang basic doubts about String, message passing and context switching overhead

zxq9 zxq9@REDACTED
Tue Jan 31 09:52:08 CET 2017


On 2017年1月31日 火曜日 07:43:05 you wrote:
> zxq9/Craig,
> 
> > While unicode libs often cover many of the latinish cases they simply ignore any instances where "case" is not a binary concept -- and those are the majority in Asia.
> 
> Since I only know languages written with latin characters, could you just giving some examples on what conversions you mean and in what circumstances they are needed? 
> For example I don't see/understand why 32 and the big 32 isn't just a font size issue. :-)
> 
> I fully agree on your frustration regarding protocols with header strings that are allowed to have different casing. 

Hi, Ola.

32 and 32 are totally different codepoints, not just font size issues.

11> $3.
51
12> $2.
50
13> $3.
65299
14> $2.
65298

So "32" is [51,50] while "32" is [65299,65298]. When accepting user input or reading a page from a book or interpreting a webpage or data stored in a database or a player profile that is arriving as JSON or a chat room name or whatever -- any case where you may need to unambiguously interpret a string as an integer -- you need to know the difference. These are NOT font differences, these are actually different characters (codepoints, really) that have the same value to humans but are not interpreted as the same value by the computer.

So which one of these is "upper case"?

Neither.

There are several different sets of numeric glyphs in unicode that all mean 0-9 to humans. Which one is the "right" one? How do I verify totally valid user input in this environment? Why should programs written to work in Russia accept text that represents numbers there regardless what input mode the user is in, but require that Japanese (among many others) users to only input "half-width" numerals? Not only that, internationalization of any user-facing program will require that the input fields not only say "Enter your Foo Number" in the correct language, somehow the whole interface will be required to display an extra, special message that only occurs when the user's locale is set to a subset of special input locales that happen to have full-width text numerals in addition to the standard ASCII values.

See the problem?

That is just one problem. The lack of actual script casting VS only the special case of upper() and lower() means that I cannot use any unicode library function to compare two exactly equivalent strings that represent a user's name in sound-spelling. There is no unicode:compare("たろう", "タロウ"), for example, even though they are directly equivalent, codepoint for codepoint. This is directly analogous to performing a case-insensitive comparison of compare("foo", "FOO") -- but it simply is not possible in Japanese without writing a library for it. There are indirect equivalencies as well, such as compare("ホンダ", "ホンダ") -- one is a list of three codepoints, the other a list of four codepoints, but they are directly equivalent and ordinary to receive as input from users or databases. In the exact same way, Japanese libraries typically have a romanized indrect comparison for, say, compare("honda", "ほんだ").

Kanji comparisons typically only exist in heavy-duty text processing and text search libraries and can only give a probability match or a boolean possible match based on a reading table of possible character soundings -- I would never expect kanji comparisons in a simple unicode library. Also, person and place proper names are notoriously difficult for anything but the human brain to disambiguate (and even then you only know what you already know -- our minds aren't yet equipped with wifi).

An issue we haven't touched on yet is that unicode allows characters to be composed or explicit. So the glyph we think of as "だ", for example, may be the codepoint "だ" (which is the most common to encounter) or the characters "た" and "゛" with a special marker that tells the system reading it that they belong together. Once again, they are directly equivalent and legal input from users or non-human data systems and SHOULD be able to compare equivalently, but they are totally different in representation -- just because. And both are legal unicode.

This difference is prevalent in Chinese intermediate user input because several stroke (radical) based input systems exist alongside complete glyph systems -- they typically are resolved to a single characters before arriving at your program, but not always. In Korean characters can be composites or explicit as well, and since the entire hangul writing system is based on compositions and the keyboard is basically split down the middle between the vowel-ish ones and the consonant-ish ones, providing partial-input search is predicated on interpretation of partials.

In practice we can often control the situation just enough to get at least some useful work done, but that doesn't change the fact that the situation is actually a bit nightmarish and silly and we shouldn't be reinventing this particular wheel all the time in 2017. That users have become accustomed to expecting all data systems to be totally screwed up and stupid is a powerful indictment of the art of text processing today. Japanese users simply accept that this is just how the world is -- at least the world outside of video games and closed platforms. That's also why you don't see a lot of Japanese business software outside Japan, and rarely see any of the popular American business programs inside Japan -- and internationalization of games is hard enough that quite a few of them are intended for the domestic market or the international market, but usually not both at the same time. The mess of data issues is just a big pile of poo and often isn't worth dealing with outside of mega projects.

Contemplate that last bit for a while. This is really a stupid reason to constrain the software market, but that's just how things are -- mostly because very few string comparison or conversion libs actually do script casting and instead focus on this one special case of upper() and lower().

(Don't get me started on calendar libraries or alternative numeric representations that have totally unambiguous representations...)

Anyway, I understand the situation and have simply had to write libraries to deal with this in several programming environments already -- I have exactly zero hope that anything will get better in the future. The annoying part is doing that over and over because most of the things I write wind up becoming proprietary, so the next time I go somewhere else I write the same stuff again (often there is a prohibition against using FOSS and also I am usually not very excited about writing some MIT-licensed stuff for free on github between projects when I have a gigantic personal backlog already).

It should SURPRISE me when a built-in upper() or lower() function in a language operates on more than just ASCII for the purpose of text protocol interpretation. If an upper() or lower() function DOES work on, say, Russian then I would also expect that it would work on Greek, and when that isn't true it is odd. When that does not carry over to, say, Hebrew print/script character conversions and comparisons or hiragana/katakana conversions and comparisons I'm just disappointed because it seems the libs arbitrarily support lang X with special cases [X1, X2, X3, ...] but not some other.

Hopefully I explained that adequately.

-Craig


More information about the erlang-questions mailing list