[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

Sun Oct 30 06:12:42 CET 2011

> I would be perfectly fine with a proposal that said "we use 4-byte
characters, just like Linux wchar_t."
> I would also be OK with a proposal that said "we use 2-byte characters,
just like Windows, and only support the 65535 character subset."
> Significantly better performance, slightly worse coverage of 10646.

There is no big deal when we talk about code points or code elements (in
UTF-16). Their importance is greatly exaggerated. Get me example, when "O(1)
INDEXING OF UNICODE CODE POINTS" is useful.
There is an opinion of ICU's developers.

Using UTF-8 strings with ICU
> As mentioned in the overview of this chapter, ICU and most other
> Unicode-supporting software uses 16-bit Unicode for internal processing.
> However, there are circumstances where UTF-8 is used instead. This is
> usually the case for software that does little or no processing of
> non-ASCII characters, and/or for APIs that predate Unicode, use byte-based
> strings, and cannot be changed or replaced for various reasons.
> A common perception is that UTF-8 has an advantage because it was designed
> for compatibility with byte-based, ASCII-based systems, although it was
> designed for string storage (of Unicode characters in Unix file names)
> rather than for processing performance.
> While ICU mostly does not natively use UTF-8 strings, there are many ways
> to work with UTF-8 strings and ICU. For more information see the newer
> UTF-8 subpage.
>

> Using UTF-32 strings with ICU
> It is even rarer to use UTF-32 for string processing than UTF-8. While
> 32-bit Unicode is convenient because it is the only fixed-width UTF, there
> are few or no legacy systems with 32-bit string processing that would
> benefit from a compatible format, and the memory bandwidth requirements of
> UTF-32 diminish the performance and handling advantage of the fixed-width
> format.
> Over time, the wchar_t type of some C/C++ compilers became a 32-bit
> integer, and some C libraries do use it for Unicode processing. However,
> application software with good Unicode support tends to have little use for
> the rudimentary Unicode and Internationalization support of the standard
> C/C++ libraries and often uses custom types (like ICU's) and UTF-16 or
> UTF-8.

>From http://userguide.icu-project.org/strings

-- 
Best regards,
Uvarov Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111030/496c07e0/attachment.htm>