[erlang-questions] The importance of Basic Unicode Understanding in Erlang

Thu Sep 29 13:20:01 CEST 2011

On 2011-09-28, at 17:14 PM, Richard Carlsson wrote:

> On 09/27/2011 05:37 PM, Frédéric Trottier-Hébert wrote:
>> I've recently done some work where, due to circumstances, unicode
>> woes were had by everyone. It kind of got me by surprise, and I
>> figure that if it hasn't bitten you yet, it might sooner or later. As
>> such, I published a blog post on the issue yesterday:
>> http://ferd.ca/will-the-real-unicode-wrangler-please-stand-up.html
> 
> While I definitely agree with the sentiment that standard library support for unicode string normalisation, collation, etc., is needed, there are some things in your blog post I want to point out:
> 
> - The "good old length and comparison functions" are not broken, they just answer much simpler questions than what you're asking. length(S) tells you how many code points are in string S, no more, no less. Not glyphs, not graphemes, not abstract characters. Code points. Similar for comparisons. And for some applications, this is all you need. It's only when you want to apply "human" ideas of order and visual appearance that you need to use special library functions - and if you do this, you should _know_ that this is what you're doing; not hope that a primitive function like length(S) will guess what kind of information you want it to compute.

That is a good point
> 
> - There's nothing strange about having to use ~ts instead of ~s in format strings: similar changes have to be made in C code to handle wide characters and multibyte encodings. Backwards compatibility with the existing codebase is simply a necessary thing. Yes, you have to update your source code if you want to make it work on Unicode.

It is an annoyance when you compare it to languages where strings can contain information about their own encoding. You print stuff and that's it, no questions asked. It is not strange, though, and I should have worded it better.
> 
> - If you're taking a Unicode string that starts with a combining character, and then append this onto a base character, as in your [$a, U1, U2] example, you're doing something odd (though not illegal), and I don't see - at least not after just a brief glance - that the Unicode specification suggests a way of handling this in a standard way. I guess that one could insert a zero-width space base character each time one concatenates two strings, but that assumes that you _didn't_ actually want the effect that you got, and who is to know that? Or you could sanitize your input so that you don't have strings that start with combining characters.

When you're writing libraries or certain applications, you don't get to choose what input you get, sadly. We've had issues in socket.io-erlang based on that. Part of the problem is that the encoded strings we get depend on the length of the string on whatever base Javascript is used, and not codepoints. The problem is that they use some kind of weird framing of the form "~Length~~m~Message~Length~~m~OtherMessage". As such, when a control character is submitted at the beginning or the end of a message, the length you have is different from the length that was counted earlier. You have to be extra-careful about splitting your strings into independent fragments before extracting messages. This means you can't just take the length and split away (even with unicode-compliant functions), you have to actually take a look at what's at the end of the 'Message' in order to do it right. As far as I know, there is no way around it.

> 
> - Your example io:format("~ts~n",[binary_to_list(<<208,...,33>>)]) doesn't work because the input to ~ts is expected to be of type unicode:chardata(), which is an iolist-like mixture of lists of integers and binary fragments, where integers are expected to be _Unicode code points_ while binaries are interpreted as being in UTF-8. Hence, your first example that passed the binary directly works fine, but when you convert it to a list of bytes using binary_to_list, that list is not what ~ts expects, because it's still in UTF-8. If you do io:format("~ts~n",[unicode:characters_to_list(<<208,...,33>>)]) instead, it works fine, and is equivalent to io:format("~ts~n",[[1055,1088,1080,1074,1077,1090,33]]).

Right. That's what I tried to explain in the text. Obviously the string is output, but wrong. That's when I get to discuss the unicode module with some examples, and add: "if you get <<101,204,129>> and convert it to a string using binary_to_list, you'll get the list [101,204,129], which is an entirely different unicode string. Using the unicode module, you'll get the right thing back"
> 
> - list_to_binary crashes on input that is outside the range 0..255 precisely so that you may catch errors early. If you suddenly start to feed Unicode strings to an existing program, and it silently does something like truncating to bytes upon writing to file, you could end up with a lot more damage. It's better to be told right away that your code requires 8-bit data at that point, so you have a chance of figuring out who should do the encoding and where.

True. It's not a bad thing, and I agree that letting it crash is a decent choice there.

> 
>    /Richard
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions