[erlang-questions] Unicode question

Sat Mar 13 17:28:09 CET 2010

<< <<C/utf8>> || C <- [20320,22909,9786] >>.
<<228,189,160,229,165,189,226,152,186>>

unicode:characters_to_list(<<228,189,160,229,165,189,226,152,186>>, utf8).
[20320,22909,9786]

The naming of the function is a bit strange ...

/Tony

On 13 mar 2010, at 04.07, Brian Acton wrote:

> Hi guys,
> 
> I've tried to shorten my problem into a simple sub problem that hopefully,
> someone can provide some insight.
> 
> Suppose, I am given the following utf-8 encoded input string:
> [ 16#e4, 16#bd, 16#a0, 16#e5, 16#a5, 16#bd, 16#e2, 16#98, 16#ba ]
> 
> Which translates into three unicode code points:
> [20320,22909,9786]
> 
> Now, I would like to shorten the string intelligently by choosing a
> substring that retains character boundaries but also fits within a limited
> number of bytes.
> 
> For example. Let's say my byte limit is 5 bytes.
> 
> In my example above, I can only take the first 3 bytes yielding the first
> character. If I try to take the next character, I will have used 6 bytes
> (each character in my example uses 3 bytes utf-8) and gone over my budgeted
> byte allocation.
> 
> So, in order to solve this, I figured that I would need to be able to
> convert utf-8 to code point and from code point to utf-8
> 
> The problem is that I can't figure out how to convert utf-8 to code point.
> Everything that I have looked at yields code point to utf-8 conversion but I
> have not found the inverse function which converts utf-8 to code point.
> 
> Am I going about this all wrong ? I've read the unicode page pretty
> extensively and I couldn't find anything. I also came across EEP10 but it
> does not look like it has been completely implemented (notably
> unicode:utf8_to_list is missing). I am using R13B03
> 
> Thanks in advance,
> 
> --b