[erlang-questions] byte() vs. char() use in documentation

Raimo Niskanen <>
Wed May 4 09:57:34 CEST 2011


On Wed, May 04, 2011 at 05:33:58AM +1000, Anthony Shipman wrote:
> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
> > The programmer should regard strings as a sequence of unicode code points.
> > As such they are just that and there is no encoding to bother about.
> > The code point number uniquely defines which unicode character it is.
> 
> As I recall, a Unicode character can be composed of up to 7 code points.
> To quote a text book I'm looking at now:
> -------------
> The trick is, again, to disabuse yourself of the idea that a one-to-one 
> correspondence exists between "characters" as the user is used to thinking of 
> them and code points (or code units) in the backing store. Unicode uses the 
> term "character" to mean more or less "the entity that's represented by a 
> single Unicode code point," but this concept doesn't always match the user's 
> definition of "character".
> -------------

There seems to be a terminology here clash that I will remember for the future.
When I talked about "Unicode code points" I ment the character number
in the Unicode system. I did not think it was allowed to talk about "code points"
when talking about byte encoded data. There are text books that talk about
"code points (or code units) in the backing store". I find that very confusing.
I will aways call it "byte encoding" or something like that.

> 
> I think a more complete design would represent a character as a binary that is 
> a UTF8 encoding of its code points. A string would then be a deep list of 
> these binaries.
> 
> -- 
> Anthony Shipman                    Mamas don't let your babies 
>                    grow up to be outsourced.
> _______________________________________________
> erlang-questions mailing list
> 
> http://erlang.org/mailman/listinfo/erlang-questions

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB



More information about the erlang-questions mailing list