[erlang-questions] byte() vs. char() use in documentation

David Mercer <>
Thu May 5 19:07:24 CEST 2011


On Thursday, May 05, 2011, Masklinn wrote:

> It's the only way, but you can not manipulate a unicode string as a
> list
> because it's *broken*. Sure, you don't realize it if you're an
> english-speaking developer working only with english speakers. But that
> does not make it not-broken.

That's a fair criticism.  As an English-speaker myself, I guess I would need
a concrete example of something that "breaks" Unicode.  I realize that some
functions like "length" and "reverse" may not work so well with a broken
Unicode standard, but I don't really think those come into play in
real-world scenarios.  The only times I would really need to know the length
of a string are either to compute storage requirements (in which case
strings as a list of Unicode points is pretty much exactly what I want), or
to measure actual physical size a string will take up on the screen or
paper.  For that, I also would probably be feeding the Unicode codepoints
into the font definition files to compute advance widths and bounding boxes,
etc., so again I don't *really* need to understand what a character *really*
is, just what will work.  As for reversing a string, I'm not sure that that
has ever been useful to me.  Is there a concrete example that we
native-English-speakers could understand so we can completely understand the
issues you foreigners have?

> And what "most developers" are content with has never been very high
> praises. You'd think a dweller of the Erlang mailing list would be the
> first to know it: most programmers are also content using threads and
> locks, regardless of whether that's strictly correct or not.

That's also a fair criticism.  I guess what I really meant is that "for all
practical purposes" strings as lists of Unicode codepoints will work.
Again, I *tried* to learn French and Mandarin when I was younger, but it
didn't stick, and this was before the computer era, anyway, so I was drawing
the Chinese characters by hand (which is harder than it looks, by the way)
rather than typing them into a computer, so I really have very little
exposure to these pathological languages that break Unicode.  French, I
don't think, breaks the one-character-one-Unicode-codepoint rule.  I know I
can type French words like "garçon" and "forêt" with all those
foreign-looking squiggles around letters without a problem, and I'm pretty
sure they "reverse" and "strlen" fine, too.  So does anyone have an example
I can wrap my head around?

Thanks, y'all.

Cheers,

DBM




More information about the erlang-questions mailing list