[erlang-questions] byte() vs. char() use in documentation

Thu May 5 19:24:57 CEST 2011

On 2011-05-05, at 18:44 , Anthony Shipman wrote:
> On Fri, 6 May 2011 12:36:56 am Masklinn wrote:
>>>       char() :: 0..16#10ffff
>>>       string() :: [char()]
>> 
>> It's the only way, but you can not manipulate a unicode string as a list
>> because it's *broken*. Sure, you don't realize it if you're an
>> english-speaking developer working only with english speakers. But that
>> does not make it not-broken.
>> 
>> And what "most developers" are content with has never been very high
>> praises. You'd think a dweller of the Erlang mailing list would be the
>> first to know it: most programmers are also content using threads and
>> locks, regardless of whether that's strictly correct or not.
> 
> I imagine an API providing for:
> 	iterating over the string returning a sequence of bytes (e.g. UTF8);
An encoding API returning a bytes stream would probably be a better idea (UTF-8
 is a valid encoding but not the only one)

> 	iterating over the string returning a sequence of code points;
> 
> 	iterating over the string returning a sequence of normalised composite
> 	characters each perhaps in the form of a binary.
A fully opaque representation of a grapheme cluster, with a relevant set of
manipulators for that cluster, would likely be a good idea. The normalization
of the backing code points sequence should not even be relevant (note: I may
be mistaken) if the entities you're manipulating entities at the grapheme
cluster level.