[erlang-questions] byte() vs. char() use in documentation

Tue May 10 16:13:21 CEST 2011

On Monday, May 09, 2011, Richard O'Keefe wrote:

> For *programmers*, I don't see "one thing defined in Unicode =
> one thing manipulated in the program" as particularly confusing;
> anything else (such as grapheme clusters) would confuse *me* a
> great deal more.

That right there is the crux of my position.  Unicode may or may not be a perfect representation of all the characters and strings and phrases and i18n stuff I want to do, but it's something I understand and can work with.  For my purposes and applications, Unicode is fine.

I've asked for an example where it does not work, where it breaks down, and the only one I've heard is if I were programming a word processor in Erlang.  While this might be a valid application for Erlang, it's not the kind of thing I myself work on and can relate to.  In my mind, I would have imagined word processors as having to do all sorts of specialized stuff regarding characters and how they look.  For example, I would expect a word processor to match "naïve" when I do a search on "naive", just because "i" and "ï" look similar, and "i" is easier to type.  So therefore, I'd expect word processors to do lots of specialized string manipulation and matching that I usually don't have to be concerned with.

Most programmers just want a way to represent foreign-looking strings.  Unicode is the standard for this sort of thing, and Unicode's basic element is the codepoint, so I guess I'd expect Erlang to represent strings as lists of Unicode codepoints.  Not grapheme clusters (sounds like something to do with buckyballs), and not some specialized encoding of Unicode, like UTF-8.  "ï" is Unicode codepoint 239 (or 16#EF), not the UTF-8 encoding of that.

>  At least if people deal with the things that are
> defined by Unicode they can appeal to the Unicode standard itself
> for help.