String representation in erlang

Thu Sep 15 06:43:55 CEST 2005

David Hopwood <david.nospam.hopwood@REDACTED> wrote:
	It might be a correctness issue if implementors fail to pay
	attention to either the Unicode spec or the standard they're
	implementing, and just assume that indexing by code point and by
	UTF-16 code unit are the same thing.

As I commented, they are *taught* that in too many Java textbooks.
(And Windows programming textbooks too.)

	Frankly I don't have a great deal of sympathy for implementors
	who are that careless.

If programmers only had to write programs of a few hundred lines, I
would be in agreement with you.  But when people have to work on large
systems with dozens of badly written specifications (and if I find *one*
W3C specification which is, at least in its first published version,
well written I will set up an altar to the authors), they really do
have to take *some* of their tools on trust.

I have been reading and re-reading the Unicode specifications since
version 1.0 came out, and I still find it dauntingly complex.  If a
programmer gets confused because the natural way to use the tools in
his language don't agree perfectly with what Unicode has become, it's
not the programmer who deserves the principal blame.

	int start = s.offsetByCodePoints(0, 27-1);
	stringValue.substring(start, s.offsetByCodePoints(start, 42));

That's nice to know, but none of the three Java versions on my
Sun box (Sun JDK 1.2.something, Sun JDK 1.4.0, and gcj) nor the
Java system on my PowerMac (also 1.4) has the least knowledge of
such a method.

	> (The Java API is so huge these
	> days that doubtless there is a suitable method *somewhere*, but the on-line
	> docs don't point you to it.)

	It wasn't difficult to find.

It was *impossible* for the Java 1.4 compilers to find it.
So I don't feel too bad about not finding in the documentation myself.
(In fact grepping through the Javadoc-produce HTML shows that this
method name is not *in* the documentation.)

I didn't bother looking in the 1.5 documents because a method that isn't
supported in the systems I'm actually using isn't any use to me.
When I *do* look at the Java 1.5 documents for String I find complete
gobble-de-gook like
	"The length is equal to the number of 16-bit Unicode characters
	 in the string."
What the dickens is a "16-bit Unicode character" when it's at home?
Does this mean that it parses the string looking for things which are
both 16-bit units *and* codes of Unicode characters?  (Hint:  some values
you can put in a Java 'char' are permentantly outside the set of Unicode
character codes.)  However, I see that there are now quite a few methods
which provide a better approximation to Unicode.  It's still rather sad
that the data type called 'char' in Java is not big enough to hold a
character, so that .charAt() and .codePointAt() are not compatible.

Remember, all along my aim has been to praise Erlang, not to dispraise
Java.  Java's designers standardisedg on 16-bit "characters" back when
Unicode had less than 40,000 characters, although it's interesting that
Sun's *C* people firmly insisted that wchar_t was to be 4 bytes from the
first.  *Because* Erlang has never had a 'character' data type, it has
evaded the trap of committing to a character size (whether 8 or 16 bits
doesn't matter) which would now be too small.