[erlang-questions] The importance of Basic Unicode Understanding in Erlang

Thu Sep 29 13:35:59 CEST 2011

On 2011-09-28, at 23:14 , Richard Carlsson wrote:
> 
> - The "good old length and comparison functions" are not broken, they just answer much simpler questions than what you're asking. length(S) tells you how many code points are in string S, no more, no less. Not glyphs, not graphemes, not abstract characters. Code points. Similar for comparisons. And for some applications, this is all you need. It's only when you want to apply "human" ideas of order and visual appearance that you need to use special library functions
I'm not sure I agree about that: let's imagine you send a name to a third-party tool, and that tool happens to have very precise ideas about normalization (e.g. it's an OSX API and it *will* manipulate only NFD strings). You send an NFC UTF-8 bytestring, you get an NDF UTF-8 bytestring, you decode to Unicode codepoints. The two unicode sequences are canonically equivalent, but not equal. This has little to do with "human ideas of order and visual appearance" now does it?

> - and if you do this, you should _know_ that this is what you're doing; not hope that a primitive function like length(S) will guess what kind of information you want it to compute.
I really don't agree. A good API should make doing the common and right thing *easy* (and fool-proof), and the uncommon (and usually wrong) thing harder. Most string APIs do the exact opposite re. unicode, that's bonkers. How often do you need the codepoints-length of a unicode sequence? The length of a UTF-encoded binary stream yes, the number of grapheme clusters (for elision or length cutoff of some character field), but the unicode sequence? I don't think I've ever had this need. The number of grapheme clusters for elision or length cutoff yes, but definitely not the number of codepoints.

> - There's nothing strange about having to use ~ts instead of ~s in format strings: similar changes have to be made in C code to handle wide characters and multibyte encodings. Backwards compatibility with the existing codebase is simply a necessary thing. Yes, you have to update your source code if you want to make it work on Unicode.
On the other hand, Erlang is not C, it's not like I have to do pointer arithmetics or consider collection ownership to know when to manually release my arrays.