[erlang-questions] unicode in string literals

Wed Aug 1 06:14:00 CEST 2012

On 31/07/2012, at 9:36 PM, CGS wrote:

> There are many pros and cons for switching from Latin-1 to UTF-8 (or whatever else which will nullify pretty much the understanding of byte character). On one hand, lists:reverse/1 really messes up the characters in the list

Yes, and that's not all it messes up by any means.

- If you have a sequence of lines represented as a string with network
  line terminators (CR+LF) then the reversal of that list is NOT a
  sequence of lines with network line terminators (applies to ASCII)

- If you use Unicode language tags, then the reversal of a language
  tag is a language tag for a different language and applies to the
  wrong characters

- The reversal of a Unicode string including variant selectors (or
  other character shaping codes like ZWNJ or ZWJ) is a Unicode
  string including variant selectors &c applied to the wrong characters

- The reversal of a Unicode string including a directional command
  and a POP DIRECTIONAL FORMATTING code is a string in which there
  is a POP before anything has been pushed.

...

So simply forming code points into [base,diacritical...] packets,
reversing the packets, and then flattening *still* isn't nearly
enough to make sense of a reversed string.  Indeed, I am not sure
that there *is* any way to make sense of the notion of reversing
a Unicode string.

So I do not take 'lists:reverse/1 will not reverse a Unicodepoint
string correctly' as a criticism of representing strings as lists
of Unicodepoints.  NOTHING will.  I don't think there is any such
thing as "correctly" reversing such a string.

There are other operations you can easily do with a list that
don't make sense for Unicode strings either.  Take just one
example: splitting a string at an arbitrary position.  That can
separate a directional override from its pop.  And having a
distinct data type is no protection against that problem:  Java
and Javascript both have opaque string datatypes, but both
allow slicing a well formed string into pieces that are not
well formed.

> (to follow the first example, the output of "a∞b" in Latin-1 is totally different from the output of lists:reverse("b∞a") in Latin-1 - the default now). On the other hand, having, for example, Polish characters like "Ą Ę Ć" or French "Ç Î" or German "Ö ß" or Turkish "Ş" and so on (things become more complicated if we add languages based on different alphabet/symbols) in the code would require your editor to have support for those languages or else you will see really strange characters there.

Well, yes.  But now you are asking whether the editor supports Unicode.
There are now plenty of editors that do.  Right now I am composing mail
in an unbelievably crude text editor (the Mail program on Mac OS X) and
it displays these characters just fine.
> 
> "-encoding()" can make quite a mess in a file. Think of an open source project in which devs from different countries append their own code. You will see a lot of "-encoding()" directives in a single file.

Nobody is suggesting that there should be an -encoding directive anywhere
but the first line of a file (or possibly the second).  In fact it is
precisely the existence of -encoding directives that would make it possible
to *avoid* the mess you are describing.

Here's what you do.

(1) Write a tiny little program.  Here is a first draft.

#!/usr/bin/awk -f
# Usage:   epaste.awk file1.erl... >pasted.erl
# Purpose: paste files in various encodings giving one file in UTF-8.

BEGIN {
    print "-encoding(utf_8)."   
    for (i = 1; i < ARGC; i++) {
        input = ARGV[i]  
        getline x < input
        if (x ~ /^[ \t]*-[ \t]*encoding\([ \t']*[a-zA-Z0-9_]*[ \t']*\)/) {
            sub(/^[ \t]*-[ \t]*encoding\([ \t']*/, "", x)
            sub(/[ \t']*\).*$/, "", x)
            x = toupper(x)   
            gsub(/_/, "-", x)
            cmd = "iconv -f " x " -t UTF-8"
        } else {
            cmd = "iconv -f ISO-8859-1 -t UTF-8"
            print x >cmd
        }
        while ((getline x <input) > 0) print x >cmd
        close(cmd)
    }
}

(2) Instead of pasting together several files by doing
    cat foo.erl ugh.erl bar.erl >fub.erl
    just do
    epaste.awk foo.erl ugh.erl bar.erl >fub.erl

What makes this *possible* is the existence of the -encoding lines.
Without it you are FUBAR.

> I might be wrong, but, switching to default UTF-8, wouldn't that force the compiler to use 2-byte (at least) per character?

Yes, you are wrong.  Unicode is a 21-bit character set.
There are currently (6.1) more than 100,000 defined
characters, so 2 bytes is definitely not enough.

But UTF-8 is an *external* format.
What the compiler uses is entirely up to itself.
What the run-time system uses is something different again.
Atom names, for example, could be stored in some compressed format.

> If so, for example, what about the databases based on Erlang for projects using strict Latin-1?

What about them?  Do not make the mistake of confusing a
particular set of characters with a way of encoding them.