[erlang-questions] byte() vs. char() use in documentation
Anthony Shipman
als@REDACTED
Wed May 4 20:28:17 CEST 2011
On Wed, 4 May 2011 06:05:24 am David Mercer wrote:
> On Tuesday, May 03, 2011, Anthony Shipman wrote:
> > I think a more complete design would represent a character as a binary
> > that is
> > a UTF8 encoding of its code points. A string would then be a deep list
> > of
> > these binaries.
>
> How is that superior than representing a character by a single integer
> representing the Unicode codepoint, a string by a list of characters? You
> can always use unicode:characters_to_binary/1 to convert to a UTF-8 binary
> if you wish.
What we think of as a character, e.g. some letter on a page, can be a
combination of a base component and some combining components. (I use the
word component since I'm not quite sure at the moment exactly what a glyph
means. A component is represented by a code point). Combining components
include accents and a variety of other marks that some languages attach to
the base component. For example in French the "e-acute" could be represented
as a single code point or as a pair of the code points for "e" and "acute
accent". The standard puts some effort into defining a canonical
representation so that it isn't a total nightmare to tell if two characters
are the same. You have to convert a Unicode string to its canonical form
before you can test for equality.
To fully implement the intent of Unicode we need to talk in terms of
characters, i.e. something you may insert or delete in a word processor,
which may themselves be a sequence of code points which are kept together.
--
Anthony Shipman Mamas don't let your babies
als@REDACTED grow up to be outsourced.
More information about the erlang-questions
mailing list