[erlang-questions] String encoding and character set

Richard A. O'Keefe ok@REDACTED
Wed Jan 17 05:56:31 CET 2007


"Ludovic Coquelle" <lcoquelle@REDACTED> wrote:
	My guess is that with a string format you can access the nth
	character of the message by its position, which can be very
	difficult to do with a list if the encoding support different
	size for different characters (and sometimes the same character
	can have different encoding depending of previous ones:
	contextual encoding) ... 

Note that Unicode has "floating diacriticals" AND it has a number
of "precomposed characters".
Floating diacriticals means

    user_character --> base_character floating_diacritical*

(where a floating_diacritical could be "dot below" or "accent grave
above" amongst many possibilities).  Precomposed characters means that
a large number of base_character+floating_diacritical combinations
are also assigned single code points.  For example, e-acute could be
one code-point (identical to the Latin-1 value) or two (base e, ' diacritical).
There is in fact no theoretical limit to the number or kind of floating
diacritical accents that may be added to any base.

So in Unicode, "access(ing) the nth character ... by its position" is
not only ambiguous (do you mean "characters" or "code points") but
essentially useless.  You might pick up an "e", but it is _really_ just
the first part of e-acute.

The fundamental operations on strings are

    (1) decode binary to string using some encoding
    (2) encode string to binary using some encoding
    (3) compare using locale- and application-appropriate rules
    (4) parse, typically using regular expressions
    (5) unparse

The combination {Encoding_Name_Atom,Binary} can *be* a string if
you really want.




More information about the erlang-questions mailing list