[erlang-questions] The importance of Basic Unicode Understanding in Erlang
Richard Carlsson
carlsson.richard@REDACTED
Wed Sep 28 23:14:39 CEST 2011
On 09/27/2011 05:37 PM, Frédéric Trottier-Hébert wrote:
> I've recently done some work where, due to circumstances, unicode
> woes were had by everyone. It kind of got me by surprise, and I
> figure that if it hasn't bitten you yet, it might sooner or later. As
> such, I published a blog post on the issue yesterday:
> http://ferd.ca/will-the-real-unicode-wrangler-please-stand-up.html
While I definitely agree with the sentiment that standard library
support for unicode string normalisation, collation, etc., is needed,
there are some things in your blog post I want to point out:
- The "good old length and comparison functions" are not broken, they
just answer much simpler questions than what you're asking. length(S)
tells you how many code points are in string S, no more, no less. Not
glyphs, not graphemes, not abstract characters. Code points. Similar for
comparisons. And for some applications, this is all you need. It's only
when you want to apply "human" ideas of order and visual appearance that
you need to use special library functions - and if you do this, you
should _know_ that this is what you're doing; not hope that a primitive
function like length(S) will guess what kind of information you want it
to compute.
- There's nothing strange about having to use ~ts instead of ~s in
format strings: similar changes have to be made in C code to handle wide
characters and multibyte encodings. Backwards compatibility with the
existing codebase is simply a necessary thing. Yes, you have to update
your source code if you want to make it work on Unicode.
- If you're taking a Unicode string that starts with a combining
character, and then append this onto a base character, as in your [$a,
U1, U2] example, you're doing something odd (though not illegal), and I
don't see - at least not after just a brief glance - that the Unicode
specification suggests a way of handling this in a standard way. I guess
that one could insert a zero-width space base character each time one
concatenates two strings, but that assumes that you _didn't_ actually
want the effect that you got, and who is to know that? Or you could
sanitize your input so that you don't have strings that start with
combining characters.
- Your example io:format("~ts~n",[binary_to_list(<<208,...,33>>)])
doesn't work because the input to ~ts is expected to be of type
unicode:chardata(), which is an iolist-like mixture of lists of integers
and binary fragments, where integers are expected to be _Unicode code
points_ while binaries are interpreted as being in UTF-8. Hence, your
first example that passed the binary directly works fine, but when you
convert it to a list of bytes using binary_to_list, that list is not
what ~ts expects, because it's still in UTF-8. If you do
io:format("~ts~n",[unicode:characters_to_list(<<208,...,33>>)]) instead,
it works fine, and is equivalent to
io:format("~ts~n",[[1055,1088,1080,1074,1077,1090,33]]).
- list_to_binary crashes on input that is outside the range 0..255
precisely so that you may catch errors early. If you suddenly start to
feed Unicode strings to an existing program, and it silently does
something like truncating to bytes upon writing to file, you could end
up with a lot more damage. It's better to be told right away that your
code requires 8-bit data at that point, so you have a chance of figuring
out who should do the encoding and where.
/Richard
More information about the erlang-questions
mailing list