[erlang-questions] byte() vs. char() use in documentation

Masklinn masklinn@REDACTED
Mon May 9 08:15:02 CEST 2011


On 2011-05-09, at 05:45 , Richard O'Keefe wrote:
> On 6/05/2011, at 6:49 PM, Masklinn wrote:
>> On 2011-05-06, at 07:34 , Richard O'Keefe wrote:
>>>> To come back to the point, we have to define what we mean with the Erlang
>>>> char() type:
>>>> - if it's an individual character then it can naturally be represented as
>>>> a single integer for its code point
>>>> - if it's a logical character then it has to be a list of integers
>>> Since we cannot know what a logical character is, and since we need *some*
>>> representation of Unicode code points, I recommend that char()=code point.
>> Why pick code points rather than grapheme cluster?
> 
> For so many many reasons I haven't the patience to list them all.
> A.  Because it is the simplest thing that could possibly work.
Sure. But so would machine integers be in Erlang, yet the core team
made a different choice.

It is also, for most users, the most confusing choice by far.

> B.  Because grapheme clusters aren't any better a fit to the user's
>    perception of a "character" than code points.
My experience runs counter to that. Do you have examples of codepoints
sequences in which individual code points are better fit for a
"user character" than the corresponding grapheme cluster?

> C.  Because Unicode properties are defined for characters, not
>    grapheme clusters.
In most cases where this is of interest, either the cluster and 
the codepoint have a 1:1 mapping or you're only interested in
the properties of the base character. Either way, this should
not be much of an issue in 99.9% of cases (number pulled out of
nowhere), and the base abstraction being grapheme clusters does
not mean access to code points is prohibited to those who need
such an access.

> D.  Because there is a finite and not *hopelessly* large set of
>    code points, but the set of grapheme clusters is unbounded
The set of grapheme cluster is no more unbounded than the set of
code points. Larger, maybe (probably), but code points are  not
arbitrarily combined into clusters.

>>>> In any case, the language must provide specific functions to work on strings
>>>> and characters. For instance, a logical character comparison must take into
>>>> account the Unicode equivalence.
>>> What do you mean "THE" equivalence?\
>> I would guess he means what he linked: unicode equivalence (as per unicode),
>> likely compatible (in order to equate "ffi" with "ffi" for instance)
> 
> Yes, but there are *several* notions of equivalence in the Unicode
> standard, which was my point.  Which of them is "THE" equivalence?
> (The one with arguably the strongest claim does NOT deal with
> compatibility mappings.)
There are *two* equivalences, and one is a subset of the other.

And I covered which equivalence would be most useful for a user non-versed
in the details of Unicode (and — one should assume — not interested in them):
compatibility equivalence, so that equivalence of ligatures is handled
intuitively (as ligatures are typographical details which leaked into the
standard, I think you would agree that, for most non-typographers, the
aforementioned "ffi" and "ffi" are "the same thing", even though they are
not canonically equivalent).




More information about the erlang-questions mailing list