[erlang-questions] Erlang basic doubts about String, message passing and context switching overhead

zxq9 zxq9@REDACTED
Tue Feb 7 11:26:18 CET 2017


On 2017年2月7日 火曜日 18:09:24 Richard A. O'Keefe wrote:
> 
> On 31/01/17 9:52 PM, zxq9 wrote:
> 
> > That is just one problem. The lack of actual script casting VS only the special case of
> 
> upper() and lower() means that I cannot use any unicode library function 
> to compare two
> 
> exactly equivalent strings that represent a user's name in sound-spelling.
> Can you clarify "sound-spelling" here?
> 
> Since the surname "Menzies" is, for example, pronounced something like
> "minnies" in Scotland but "menzees" in Australia, I' not sure how far
> "sound-spelling" would take us for Anglophone names.
> (There are plenty of other examples.)
> 
> For that matter, my mother's father's surname was Covič but in this
> country everyone pronounced it as if it was "Covick" so he and his
> brother, with the same surname, ended up pronouncing it differently.
> 
> I guess my point is that it's hard enough to tell when two names with
> the *same* spelling sound the same that I am in complete awe of anyone
> who manages to do a good-enough job telling when two *differently*
> spelled names sound the same.  Do you use a massive locale-dependent
> dictionary, or what?

The specific case I am addressing above is one that arises in East Asian
languages that have 1::1 equivalence among different phonetic scripts.

For example, the letter "ka" is か in Hiragana, and カ in Katakana, and
カ in half-width katakana, and literally "ka" or "KA" in romaji. The possible
kanji characters that could be pronounced with this phonem are not relevant
to the simple case of phonetic comparison. In Japanese this is one letter.
The case of "ga" gets more complex, because it is が, ガ and ガ, respectively,
and this last one is two characters to represent the single letter, in the
same way that "ga" and "GA" are two letters in roman script to represent this
phonem. So the rule here is a touch more complex, but not by much.

Any competent Japanese library will be able to tell me that these are the
same, and provide a way to script-cast them to a single form for storage
in a database for string comparison so that things like names (of persons,
places, companies, etc.) can be searched phonetically without having to
know the specific kanji characters in a the name. If someone tells you her
name is "Anna" you can't really know whether it is "Anna" or "アンナ" or
"あんな" or "安奈" or whatever -- so it is a universal feature of Japanese
programs that at least two input fields are provided for any name input
(and sometimes three) in the following order:
- ALWAYS there will be the phonetic spelling of the name which can be
  given in hiragana, katakana, or whatever as long as it adheres to one of
  the phonetic systems. Usually called "furigana" or "yomigana".
- USUALLY there will be a field for the canonical presentation.
- OPTIONALLY there will be a field for the canonical Romanization
  (which may differ from the rules of a transliteration system).

It is the phonetic spelling that is used for searches and requires
canonicalization. There is direct equivalence here, almost a perfect analog
to case-insensitive text comparison, and this is not at all supported by
libraries that stop at naive upper/lower casing of roman/greek/cyrillic
based scripts.

That is what I meant about upper/lower being just two special cases of
script-casting.

The kanji equivalence issue is very different and would never be expected
in a vanilla unicode library (because it is hard, language dependent, and
often regionally dependent as well).

-Craig



More information about the erlang-questions mailing list