[erlang-questions] Erlang basic doubts about String, message passing and context switching overhead

Richard A. O'Keefe ok@REDACTED
Tue Jan 17 23:57:56 CET 2017



On 17/01/17 8:17 PM, Michał Muskała wrote:
>
> I just checked and all: Python 3, Ruby, Java, C# (so probably F# as
> well) and JavaScript properly uppercase the character "ł" using the
> built-in functions, without any additional libraries.
>

I didn't use the word "additional".
The word *additional* is your word, and is entirely
foreign to what I was talking about.

Those *ARE* libraries that they are using.

I have the source code for the Java 1.8 libraries.
java-1.8-src/java/lang/Character.java
   public static char toUpperCase(char ch)
     starts at line 6368.  This is a library method
     in a library class.
   public static int toUpperCase(int ch)
     starts at line 6397.  This is a library method
     in a library class.
     It calls a method in CharacterData.

java-1.8-src/java/lang/CharacterData.java
   is an abstract class which dispatches to one of 7
   subclasses.

java-1.8-src/java/lang/CharacterData00.java
   handles the Basic Multilingual Plane.
   int toUpperCase(int ch)
     starts at line 243.  This is a library method
     for a library method in a library class for a
     library class.  It does some bit fiddling and
     has a fairly large switch to handle exceptions.
     If new characters are added to Unicode that
     need to be exceptions, the code will need
     rewriting and recompiling.

java-1.8/src/java/lang/String.java
   public String toUppercase(Locale locale)
     starts at line 2721.  The code is rather
     entertaining, if you are entertained by
     H. P. Lovecraft.  For example, *every*
     time you call this method, it fetches the
     name of the locale and checks to see if it
     is "tr" or "az" or "lt".  If another language
     with case conversion quirks is added, the
     code will need rewriting and recompiling.
     The locale-dependent code keeps on calling
     out to a method in ConditionalSpecialCasing.
     Well, actually two methods get called for
     each character in general: one to get a
     single character, one to get an array of
     characters.  At any rate, this is a library
     method in a library class.

No, I am not going to dig into ConditionalSpecialCasing.

These aren't *additional* libraries any more than
the unicode module in Erlang is an *additional*
library in Erlang.  The Java library classes are
plain Java code compiled by the plain Java compiler,
which knows nothing about how they work.  The
Erlang modules for handling text are plain Erlang
code compiled by the plain Erlang compiler, which
knows nothing about how they work.

Do note, though, that there are an abstract class
with 7 subclasses and another concrete class that
are there to support Character and String, which
users normally should not use.

Note also that while CharacterData{00,01,02,0E,Latin1}
were generated automatically, so are presumably up to
date, ConditionalSpecialCasing, which handles Greek,
Turkish, Azeri, and Lithuanian, was *not*, and its
date is 2013.  So it may well not be up to date.
The same is true of Character (2013).  I have no
confidence that it's up to date.  And while
ConditionalSpecialCasing has data for Greek, I note
that String doesn't consider Greek to be one of the
languages that needs it...

It really is *useful* for Unicode processing to be
in library files that can be automatically regenerated
from the Unicode data base.


If you want to really understand case conversion in
Unicode, spend a couple of hours figuring out
exactly how String.toUpperCase() is done in Java.
It's so hairy that they spend an extra pass over
the string trying NOT to do it.



More information about the erlang-questions mailing list