[erlang-questions] Erlang basic doubts about String, message passing and context switching overhead
Richard A. O'Keefe
ok@REDACTED
Tue Jan 17 23:57:56 CET 2017
On 17/01/17 8:17 PM, Michał Muskała wrote:
>
> I just checked and all: Python 3, Ruby, Java, C# (so probably F# as
> well) and JavaScript properly uppercase the character "ł" using the
> built-in functions, without any additional libraries.
>
I didn't use the word "additional".
The word *additional* is your word, and is entirely
foreign to what I was talking about.
Those *ARE* libraries that they are using.
I have the source code for the Java 1.8 libraries.
java-1.8-src/java/lang/Character.java
public static char toUpperCase(char ch)
starts at line 6368. This is a library method
in a library class.
public static int toUpperCase(int ch)
starts at line 6397. This is a library method
in a library class.
It calls a method in CharacterData.
java-1.8-src/java/lang/CharacterData.java
is an abstract class which dispatches to one of 7
subclasses.
java-1.8-src/java/lang/CharacterData00.java
handles the Basic Multilingual Plane.
int toUpperCase(int ch)
starts at line 243. This is a library method
for a library method in a library class for a
library class. It does some bit fiddling and
has a fairly large switch to handle exceptions.
If new characters are added to Unicode that
need to be exceptions, the code will need
rewriting and recompiling.
java-1.8/src/java/lang/String.java
public String toUppercase(Locale locale)
starts at line 2721. The code is rather
entertaining, if you are entertained by
H. P. Lovecraft. For example, *every*
time you call this method, it fetches the
name of the locale and checks to see if it
is "tr" or "az" or "lt". If another language
with case conversion quirks is added, the
code will need rewriting and recompiling.
The locale-dependent code keeps on calling
out to a method in ConditionalSpecialCasing.
Well, actually two methods get called for
each character in general: one to get a
single character, one to get an array of
characters. At any rate, this is a library
method in a library class.
No, I am not going to dig into ConditionalSpecialCasing.
These aren't *additional* libraries any more than
the unicode module in Erlang is an *additional*
library in Erlang. The Java library classes are
plain Java code compiled by the plain Java compiler,
which knows nothing about how they work. The
Erlang modules for handling text are plain Erlang
code compiled by the plain Erlang compiler, which
knows nothing about how they work.
Do note, though, that there are an abstract class
with 7 subclasses and another concrete class that
are there to support Character and String, which
users normally should not use.
Note also that while CharacterData{00,01,02,0E,Latin1}
were generated automatically, so are presumably up to
date, ConditionalSpecialCasing, which handles Greek,
Turkish, Azeri, and Lithuanian, was *not*, and its
date is 2013. So it may well not be up to date.
The same is true of Character (2013). I have no
confidence that it's up to date. And while
ConditionalSpecialCasing has data for Greek, I note
that String doesn't consider Greek to be one of the
languages that needs it...
It really is *useful* for Unicode processing to be
in library files that can be automatically regenerated
from the Unicode data base.
If you want to really understand case conversion in
Unicode, spend a couple of hours figuring out
exactly how String.toUpperCase() is done in Java.
It's so hairy that they spend an extra pass over
the string trying NOT to do it.
More information about the erlang-questions
mailing list