[erlang-questions] String encoding and character set
Richard A. O'Keefe
ok@REDACTED
Wed Jan 17 07:19:36 CET 2007
I wrote:
> So in Unicode, "access(ing) the nth character ... by its position" is
> not only ambiguous (do you mean "characters" or "code points") but
> essentially useless.
Jani Hakala <jahakala@REDACTED> seems to wish for an ideal world:
Accessing the nth glyph in case of unicode string shouldn't be
ambiguous?
I didn't say "glyph" (which is in any case a rendering concept, *NOT*
a Unicode concept), I said "character".
If there was a glyph type in erlang it would be possible to
have a list of glyphs and accessing the nth glyph should be possible.
There *is* a type in Erlang which can be (and is) used for Unicode
>>codepoints<<, and that is 'integer'.
I have no idea what you mean by 'glyph'. I repeat the point I made before:
a *single* logical text unit as far as some user is concerned might be
*many* Unicode codepoints, and might moreover be expressed as more than
one sequence of codepoints. I know of several Unicode characters that can
be encoded in three different ways, and if memory serves me I once found
one that could be encoded in five.
If glpyh=codepoint, then a tuple of integers gives us O(1) indexing.
If glpyh=base + diacriticals, there is *NO* programming language that
has a string type that gives us O(1) indexing.
More information about the erlang-questions
mailing list