[erlang-questions] The importance of Basic Unicode Understanding in Erlang

Tue Sep 27 17:37:42 CEST 2011

Hi there everyone.

I've recently done some work where, due to circumstances, unicode woes were had by everyone. It kind of got me by surprise, and I figure that if it hasn't bitten you yet, it might sooner or later. As such, I published a blog post on the issue yesterday: http://ferd.ca/will-the-real-unicode-wrangler-please-stand-up.html

It's not going into advanced details, but it's about some very simple warnings. When dealing with strings, all the binary_to_list, list_to_binary, iolist_to_binary functions are to be avoided. The length function is no longer safe and neither are comparison operators. When using io:format, "~s" is no longer what we want all the time, but rather "~ts", etc. This partial support is weirder for countries and languages that depend on some unicode characters for their everyday use when the Erlang source files are always assumed to be latin-1, although the Erlang shell is fine with unicode.

I'm no expert in i18n affairs, but we currently have no standard library way to do basic operations such as calculating the length of strings, splitting binaries or items by clusters, ways to perform normalisations, doing uppercase/lowercase/titlecase strings, comparing strings, reversing them, etc. We have to rely on external libraries. While these libraries are not bad, it is obvious that standard implementations are usually nicer for everyone. I'm also in no position to force people to implement libraries are need when I'm offering no money incentive myself.

As such, I felt like having (yet another) discussion of the issues of unicode, and what we think would be the ideal way to solve the problem within Erlang. Any opinion?

--
Fred Hébert
http://www.erlang-solutions.com