[erlang-questions] String encoding and character set

Wed Jan 17 15:00:52 CET 2007

Those "floating diacritics" are handled through Normalisation. An
intelligent set of string functions should be able to normalize
strings and extract a character correctly whether it was originally
encoded on one or two codepoints. The UCD provides all the information
about this.

Parsing only with regexes is slow, even on languages that have
fast-ish regexes [Erlang's alas not in this case]. While I do like
regexes, there's a lot that can be done faster with dedicated string
manipulation functions.

-- 
dda

On 1/17/07, Richard A. O'Keefe <ok@REDACTED> wrote:
> "Ludovic Coquelle" <lcoquelle@REDACTED> wrote:
>         My guess is that with a string format you can access the nth
>         character of the message by its position, which can be very
>         difficult to do with a list if the encoding support different
>         size for different characters (and sometimes the same character
>         can have different encoding depending of previous ones:
>         contextual encoding) ...
>
> Note that Unicode has "floating diacriticals" AND it has a number
> of "precomposed characters".
> Floating diacriticals means
>
>     user_character --> base_character floating_diacritical*
>
> (where a floating_diacritical could be "dot below" or "accent grave
> above" amongst many possibilities).  Precomposed characters means that
> a large number of base_character+floating_diacritical combinations
> are also assigned single code points.  For example, e-acute could be
> one code-point (identical to the Latin-1 value) or two (base e, ' diacritical).
> There is in fact no theoretical limit to the number or kind of floating
> diacritical accents that may be added to any base.
>
> So in Unicode, "access(ing) the nth character ... by its position" is
> not only ambiguous (do you mean "characters" or "code points") but
> essentially useless.  You might pick up an "e", but it is _really_ just
> the first part of e-acute.
>
> The fundamental operations on strings are
>
>     (1) decode binary to string using some encoding
>     (2) encode string to binary using some encoding
>     (3) compare using locale- and application-appropriate rules
>     (4) parse, typically using regular expressions
>     (5) unparse
>
> The combination {Encoding_Name_Atom,Binary} can *be* a string if
> you really want.