[erlang-questions] byte() vs. char() use in documentation

Wed May 4 20:28:17 CEST 2011

On Wed, 4 May 2011 06:05:24 am David Mercer wrote:
> On Tuesday, May 03, 2011, Anthony Shipman wrote:
> > I think a more complete design would represent a character as a binary
> > that is
> > a UTF8 encoding of its code points. A string would then be a deep list
> > of
> > these binaries.
>
> How is that superior than representing a character by a single integer
> representing the Unicode codepoint, a string by a list of characters?  You
> can always use unicode:characters_to_binary/1 to convert to a UTF-8 binary
> if you wish.

What we think of as a character, e.g. some letter on a page, can be a 
combination of a base component and some combining components. (I use the 
word component since I'm not quite sure at the moment exactly what a glyph 
means. A component is represented by a code point). Combining components 
include accents and a variety of other marks that some languages attach to 
the base component.  For example in French the "e-acute" could be represented 
as a single code point or as a pair of the code points for "e" and "acute 
accent". The standard puts some effort into defining a canonical 
representation so that it isn't a total nightmare to tell if two characters 
are the same. You have to convert a Unicode string to its canonical form 
before you can test for equality.

To fully implement the intent of Unicode we need to talk in terms of 
characters, i.e. something you may insert or delete in a word processor, 
which may themselves be a sequence of code points which are kept together.

-- 
Anthony Shipman                    Mamas don't let your babies 
als@REDACTED                   grow up to be outsourced.