[erlang-questions] String encoding and character set

Richard A. O'Keefe ok@REDACTED
Wed Jan 17 07:19:36 CET 2007


I wrote:
    > So in Unicode, "access(ing) the nth character ... by its position" is
    > not only ambiguous (do you mean "characters" or "code points") but
    > essentially useless.

Jani Hakala <jahakala@REDACTED> seems to wish for an ideal world:
	Accessing the nth glyph in case of unicode string shouldn't be
	ambiguous?

I didn't say "glyph" (which is in any case a rendering concept, *NOT*
a Unicode concept), I said "character".

	If there was a glyph type in erlang it would be possible to
	have a list of glyphs and accessing the nth glyph should be possible.
	
There *is* a type in Erlang which can be (and is) used for Unicode
>>codepoints<<, and that is 'integer'.

I have no idea what you mean by 'glyph'.  I repeat the point I made before:
a *single* logical text unit as far as some user is concerned might be
*many* Unicode codepoints, and might moreover be expressed as more than
one sequence of codepoints.  I know of several Unicode characters that can
be encoded in three different ways, and if memory serves me I once found
one that could be encoded in five.

If glpyh=codepoint, then a tuple of integers gives us O(1) indexing.
If glpyh=base + diacriticals, there is *NO* programming language that
has a string type that gives us O(1) indexing.




More information about the erlang-questions mailing list