[erlang-questions] The importance of Basic Unicode Understanding in Erlang

Wed Sep 28 23:14:39 CEST 2011

On 09/27/2011 05:37 PM, Frédéric Trottier-Hébert wrote:
> I've recently done some work where, due to circumstances, unicode
> woes were had by everyone. It kind of got me by surprise, and I
> figure that if it hasn't bitten you yet, it might sooner or later. As
> such, I published a blog post on the issue yesterday:
> http://ferd.ca/will-the-real-unicode-wrangler-please-stand-up.html

While I definitely agree with the sentiment that standard library 
support for unicode string normalisation, collation, etc., is needed, 
there are some things in your blog post I want to point out:

- The "good old length and comparison functions" are not broken, they 
just answer much simpler questions than what you're asking. length(S) 
tells you how many code points are in string S, no more, no less. Not 
glyphs, not graphemes, not abstract characters. Code points. Similar for 
comparisons. And for some applications, this is all you need. It's only 
when you want to apply "human" ideas of order and visual appearance that 
you need to use special library functions - and if you do this, you 
should _know_ that this is what you're doing; not hope that a primitive 
function like length(S) will guess what kind of information you want it 
to compute.

- There's nothing strange about having to use ~ts instead of ~s in 
format strings: similar changes have to be made in C code to handle wide 
characters and multibyte encodings. Backwards compatibility with the 
existing codebase is simply a necessary thing. Yes, you have to update 
your source code if you want to make it work on Unicode.

- If you're taking a Unicode string that starts with a combining 
character, and then append this onto a base character, as in your [$a, 
U1, U2] example, you're doing something odd (though not illegal), and I 
don't see - at least not after just a brief glance - that the Unicode 
specification suggests a way of handling this in a standard way. I guess 
that one could insert a zero-width space base character each time one 
concatenates two strings, but that assumes that you _didn't_ actually 
want the effect that you got, and who is to know that? Or you could 
sanitize your input so that you don't have strings that start with 
combining characters.

- Your example io:format("~ts~n",[binary_to_list(<<208,...,33>>)]) 
doesn't work because the input to ~ts is expected to be of type 
unicode:chardata(), which is an iolist-like mixture of lists of integers 
and binary fragments, where integers are expected to be _Unicode code 
points_ while binaries are interpreted as being in UTF-8. Hence, your 
first example that passed the binary directly works fine, but when you 
convert it to a list of bytes using binary_to_list, that list is not 
what ~ts expects, because it's still in UTF-8. If you do 
io:format("~ts~n",[unicode:characters_to_list(<<208,...,33>>)]) instead, 
it works fine, and is equivalent to 
io:format("~ts~n",[[1055,1088,1080,1074,1077,1090,33]]).

- list_to_binary crashes on input that is outside the range 0..255 
precisely so that you may catch errors early. If you suddenly start to 
feed Unicode strings to an existing program, and it silently does 
something like truncating to bytes upon writing to file, you could end 
up with a lot more damage. It's better to be told right away that your 
code requires 8-bit data at that point, so you have a chance of figuring 
out who should do the encoding and where.

     /Richard