[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings
Wed Oct 19 22:14:02 CEST 2011
When we say "Unicode string", we do not provide full information about
this string. There are many forms of Unicode.
Q: Which way did Erlang developers chose?
A: The Erlang way is simple. It is to represent a string as a list of
code points. It helps to solve one problem: there are no encodings
(UTF-8, UTF-16, UTF-32) in this form. There are no UTF-16BE or
UTF-16LE (endianess on different architectures is a problem of network
systems). But this way does not solve the normalization problem.
Q: But how to get my strings back to UTF-8? I want to pass it to an
A: There is other representation of Unicode text: as a binary. This
form is used for storing text and working with external programs.
Q: Is it easy to work with a list of code points?
A: Both yes and no.
I you have an algorithm, which is based on code-paint processing, then
it will be easy to implement. If you only pass text from point A to
point B, I suggest keep a string as a binary. Also you can use both
UTF-8 binaries and lists together to create an iolist from them.
A code-paint is not a character. It is an abstract representation of
graphemes or their parts. If you run `string:len(Str)', you get the
count of these code-paints (not graphemes) in Str. The problem of
Erlang is poor string-processing mechanisms (in string module). This
problem is rare in the telecom field, but it is hot problem for Web
Q: What do poor string-processing mechanisms mean?
A: There are Unicode standards
(http://unicode.org/standard/standard.html) which declares algorithms
for many operations with strings. These operations can be also
locale-dependable. The most popular realization of this operations is
ICU. There are few interfaces for this library. For example, I am
developing the NIF interface for icu4c.
The work under this library is only in the beginning, But you can see
the API. The library will be ready after R15 (because some actions
with this library can freeze the schedule of the VM).
I think we need describe what 'character' means.
Either a) [49,48,8364] (ie its a list of three integers.
Each integer is an unicode code-paint)
Or b) [49,48,226,130,172] (ie its a list of bytes (also
integers, or chars from C/C++) of the UTF-8 encoding string)
More information about the erlang-questions