String representation in erlang

David Hopwood david.nospam.hopwood@REDACTED
Thu Sep 15 03:30:19 CEST 2005


Instead of "abstract character", I should have said "combining character sequence".
Not that it really mattered to my argument, since the point was that you have
to deal with variable-length encodings *whichever* notion of character you're
using.

Richard A. O'Keefe wrote:
> (4) The fact that there are Web standards which call for indexing in
>     terms of *Unicode characters* and other Web standards which call
>     for indexing in terms of *UTF-16 encoding units* is clearly
>     regrettable.  If memory serves me, XPath and the DOM are examples
>     of standards that disagree in this way.

Yes, it is regrettable.

However, it is not really a big problem. You just have to do a little bit more
work to implement standards that index in terms of Unicode characters (i.e.
code points). This is rarely an efficiency issue, because the kind of
standards we're talking about here tend to impose enormous overhead anyway.
(Also, using UTF-32 kills performance relative to UTF-16 or -8 due to the
higher memory bandwidth needed.)

It might be a correctness issue if implementors fail to pay attention to
either the Unicode spec or the standard they're implementing, and just assume
that indexing by code point and by UTF-16 code unit are the same thing.
Frankly I don't have a great deal of sympathy for implementors who are that
careless.

[big snip]
> *THIS* is what I was referring to when I said that Java string indexing
> was stuffed up.  If you want to interpret
> 
>     substring(., 27, 42)
> 
> in XPath, simply doing
> 
>     currentNode.stringValue.substring(27-1, 27-1+42)
>     
> will give you the right answer often enough to trick you into expecting it
> to work all the time, but it WON'T.  In fact there is NO method in the
> java.lang.String class that does this job.

int start = s.offsetByCodePoints(0, 27-1);
stringValue.substring(start, s.offsetByCodePoints(start, 42));

> (The Java API is so huge these
> days that doubtless there is a suitable method *somewhere*, but the on-line
> docs don't point you to it.)

It wasn't difficult to find.

-- 
David Hopwood <david.nospam.hopwood@REDACTED>




More information about the erlang-questions mailing list