[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Thu Oct 20 09:30:26 CEST 2011

On Wed, Oct 19, 2011 at 10:27 PM, Tristan Sloughter
<tristan.sloughter@REDACTED> wrote:
> Do you have a process for getting comments and reviews as the book is
> written, besides emails like this?

Not yet -

> I really thought the way the Real World Haskell authors did their's was
> amazing: http://book.realworldhaskell.org/read/
> You can comment on, and see all comments for each section in a chapter and
> get an RSS feed of changes for chapters.

I just looked it's really nice - any idea what the underlying representation is?

It seems like each top level entity (paragraph) etc is tagged with a
"comment point"
which I guess is just an entry into a database.

> I could have sworn I found a repo that they host the entire site's code for
> all that but I can't find it once again...

I wonder if the code to host the editing/review system is open source?
- I'd like basically an empty framework
for something like this.

Any ideas what the authors used? - is their a kind of "author mode"
where thet can change the text - or do they
have to hack xml files (or something) in the background?

/Joe

> Tristan
> On Wed, Oct 19, 2011 at 3:14 PM, Michael Uvarov <freeakk@REDACTED> wrote:
>>
>> When we say "Unicode string", we do not provide full information about
>> this string. There are many forms of Unicode.
>>
>> Q: Which way did Erlang developers chose?
>> A: The Erlang way is simple. It is to represent a string as a list of
>> code points. It helps to solve one problem: there are no encodings
>> (UTF-8, UTF-16, UTF-32) in this form. There are no UTF-16BE or
>> UTF-16LE (endianess on different architectures is a problem of network
>> systems). But this way does not solve the normalization problem.
>>
>> Q: But how to get my strings back to UTF-8? I want to pass it to an
>> other application.
>> A: There is other representation of Unicode text: as a binary. This
>> form is used for storing text and working with external programs.
>>
>> Q: Is it easy to work with a list of code points?
>> A: Both yes and no.
>> Advantages:
>> I you have an algorithm, which is based on code-paint processing, then
>> it will be easy to implement. If you only pass text from point A to
>> point B, I suggest keep a string as a binary. Also you can use both
>> UTF-8 binaries and lists together to create an iolist from them.
>>
>> Disadvantages:
>> A code-paint is not a character. It is an abstract representation of
>> graphemes or their parts. If you run `string:len(Str)', you get the
>> count of these code-paints (not graphemes) in Str.  The problem of
>> Erlang is poor string-processing mechanisms (in string module). This
>> problem is rare in the telecom field, but it is hot problem for Web
>> applications.
>>
>> Q: What do poor string-processing mechanisms mean?
>> A: There are Unicode standards
>> (http://unicode.org/standard/standard.html) which declares algorithms
>> for many operations with strings. These operations can be also
>> locale-dependable. The most popular realization of this operations is
>> ICU. There are few interfaces for this library. For example, I am
>> developing  the NIF interface for icu4c.
>> The work under this library is only in the beginning, But you can see
>> the API. The library will be ready after R15 (because some actions
>> with this library can freeze the schedule of the VM).
>>
>> https://github.com/freeakk/i18n
>>
>>
>> I think we need describe what 'character'  means.
>>
>> Fix.
>>  Either a) [49,48,8364]           (ie its a list of three integers.
>> Each integer is an unicode code-paint)
>>  Or     b) [49,48,226,130,172]    (ie its a list of bytes (also
>> integers, or chars from C/C++) of the UTF-8 encoding string)
>>
>> --
>> Best regards,
>> Uvarov Michael
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>