[erlang-questions] byte() vs. char() use in documentation

Anthony Shipman als@REDACTED
Tue May 3 21:33:58 CEST 2011


On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
> The programmer should regard strings as a sequence of unicode code points.
> As such they are just that and there is no encoding to bother about.
> The code point number uniquely defines which unicode character it is.

As I recall, a Unicode character can be composed of up to 7 code points.
To quote a text book I'm looking at now:
-------------
The trick is, again, to disabuse yourself of the idea that a one-to-one 
correspondence exists between "characters" as the user is used to thinking of 
them and code points (or code units) in the backing store. Unicode uses the 
term "character" to mean more or less "the entity that's represented by a 
single Unicode code point," but this concept doesn't always match the user's 
definition of "character".
-------------

I think a more complete design would represent a character as a binary that is 
a UTF8 encoding of its code points. A string would then be a deep list of 
these binaries.

-- 
Anthony Shipman                    Mamas don't let your babies 
als@REDACTED                   grow up to be outsourced.



More information about the erlang-questions mailing list