[erlang-questions] byte() vs. char() use in documentation
Masklinn
masklinn@REDACTED
Thu May 5 19:24:57 CEST 2011
On 2011-05-05, at 18:44 , Anthony Shipman wrote:
> On Fri, 6 May 2011 12:36:56 am Masklinn wrote:
>>> char() :: 0..16#10ffff
>>> string() :: [char()]
>>
>> It's the only way, but you can not manipulate a unicode string as a list
>> because it's *broken*. Sure, you don't realize it if you're an
>> english-speaking developer working only with english speakers. But that
>> does not make it not-broken.
>>
>> And what "most developers" are content with has never been very high
>> praises. You'd think a dweller of the Erlang mailing list would be the
>> first to know it: most programmers are also content using threads and
>> locks, regardless of whether that's strictly correct or not.
>
> I imagine an API providing for:
> iterating over the string returning a sequence of bytes (e.g. UTF8);
An encoding API returning a bytes stream would probably be a better idea (UTF-8
is a valid encoding but not the only one)
> iterating over the string returning a sequence of code points;
>
> iterating over the string returning a sequence of normalised composite
> characters each perhaps in the form of a binary.
A fully opaque representation of a grapheme cluster, with a relevant set of
manipulators for that cluster, would likely be a good idea. The normalization
of the backing code points sequence should not even be relevant (note: I may
be mistaken) if the entities you're manipulating entities at the grapheme
cluster level.
More information about the erlang-questions
mailing list