[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Tristan Sloughter tristan.sloughter@REDACTED
Wed Oct 19 22:27:07 CEST 2011


Do you have a process for getting comments and reviews as the book is
written, besides emails like this?

I really thought the way the Real World Haskell authors did their's was
amazing: http://book.realworldhaskell.org/read/

You can comment on, and see all comments for each section in a chapter and
get an RSS feed of changes for chapters.

I could have sworn I found a repo that they host the entire site's code for
all that but I can't find it once again...

Tristan

On Wed, Oct 19, 2011 at 3:14 PM, Michael Uvarov <freeakk@REDACTED> wrote:

> When we say "Unicode string", we do not provide full information about
> this string. There are many forms of Unicode.
>
> Q: Which way did Erlang developers chose?
> A: The Erlang way is simple. It is to represent a string as a list of
> code points. It helps to solve one problem: there are no encodings
> (UTF-8, UTF-16, UTF-32) in this form. There are no UTF-16BE or
> UTF-16LE (endianess on different architectures is a problem of network
> systems). But this way does not solve the normalization problem.
>
> Q: But how to get my strings back to UTF-8? I want to pass it to an
> other application.
> A: There is other representation of Unicode text: as a binary. This
> form is used for storing text and working with external programs.
>
> Q: Is it easy to work with a list of code points?
> A: Both yes and no.
> Advantages:
> I you have an algorithm, which is based on code-paint processing, then
> it will be easy to implement. If you only pass text from point A to
> point B, I suggest keep a string as a binary. Also you can use both
> UTF-8 binaries and lists together to create an iolist from them.
>
> Disadvantages:
> A code-paint is not a character. It is an abstract representation of
> graphemes or their parts. If you run `string:len(Str)', you get the
> count of these code-paints (not graphemes) in Str.  The problem of
> Erlang is poor string-processing mechanisms (in string module). This
> problem is rare in the telecom field, but it is hot problem for Web
> applications.
>
> Q: What do poor string-processing mechanisms mean?
> A: There are Unicode standards
> (http://unicode.org/standard/standard.html) which declares algorithms
> for many operations with strings. These operations can be also
> locale-dependable. The most popular realization of this operations is
> ICU. There are few interfaces for this library. For example, I am
> developing  the NIF interface for icu4c.
> The work under this library is only in the beginning, But you can see
> the API. The library will be ready after R15 (because some actions
> with this library can freeze the schedule of the VM).
>
> https://github.com/freeakk/i18n
>
>
> I think we need describe what 'character'  means.
>
> Fix.
>  Either a) [49,48,8364]           (ie its a list of three integers.
> Each integer is an unicode code-paint)
>  Or     b) [49,48,226,130,172]    (ie its a list of bytes (also
> integers, or chars from C/C++) of the UTF-8 encoding string)
>
> --
> Best regards,
> Uvarov Michael
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111019/b21c5b6b/attachment.htm>


More information about the erlang-questions mailing list