[erlang-questions] byte() vs. char() use in documentation
Anthony Shipman
als@REDACTED
Tue May 3 21:33:58 CEST 2011
On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
> The programmer should regard strings as a sequence of unicode code points.
> As such they are just that and there is no encoding to bother about.
> The code point number uniquely defines which unicode character it is.
As I recall, a Unicode character can be composed of up to 7 code points.
To quote a text book I'm looking at now:
-------------
The trick is, again, to disabuse yourself of the idea that a one-to-one
correspondence exists between "characters" as the user is used to thinking of
them and code points (or code units) in the backing store. Unicode uses the
term "character" to mean more or less "the entity that's represented by a
single Unicode code point," but this concept doesn't always match the user's
definition of "character".
-------------
I think a more complete design would represent a character as a binary that is
a UTF8 encoding of its code points. A string would then be a deep list of
these binaries.
--
Anthony Shipman Mamas don't let your babies
als@REDACTED grow up to be outsourced.
More information about the erlang-questions
mailing list