[erlang-questions] Unicode question

Sat Mar 13 17:33:42 CET 2010

Hi Brian,

Below is the usage of the unicode module to convert to and from unicode points:

1> L = [ 16#e4, 16#bd, 16#a0, 16#e5, 16#a5, 16#bd, 16#e2, 16#98, 16#ba ].
[228,189,160,229,165,189,226,152,186]
2> unicode:characters_to_list(list_to_binary(L)). 
[20320,22909,9786]
3> unicode:characters_to_binary([20320,22909,9786]).
<<228,189,160,229,165,189,226,152,186>>
4> binary_to_list(unicode:characters_to_binary([20320,22909,9786])).
[228,189,160,229,165,189,226,152,186]

Regards,

Dominic Williams
http://dominicwilliams.net

Le 13 mars 2010 à 04:07, Brian Acton a écrit :

> Hi guys,
> 
> I've tried to shorten my problem into a simple sub problem that hopefully,
> someone can provide some insight.
> 
> Suppose, I am given the following utf-8 encoded input string:
> [ 16#e4, 16#bd, 16#a0, 16#e5, 16#a5, 16#bd, 16#e2, 16#98, 16#ba ]
> 
> Which translates into three unicode code points:
> [20320,22909,9786]
> 
> Now, I would like to shorten the string intelligently by choosing a
> substring that retains character boundaries but also fits within a limited
> number of bytes.
> 
> For example. Let's say my byte limit is 5 bytes.
> 
> In my example above, I can only take the first 3 bytes yielding the first
> character. If I try to take the next character, I will have used 6 bytes
> (each character in my example uses 3 bytes utf-8) and gone over my budgeted
> byte allocation.
> 
> So, in order to solve this, I figured that I would need to be able to
> convert utf-8 to code point and from code point to utf-8
> 
> The problem is that I can't figure out how to convert utf-8 to code point.
> Everything that I have looked at yields code point to utf-8 conversion but I
> have not found the inverse function which converts utf-8 to code point.
> 
> Am I going about this all wrong ? I've read the unicode page pretty
> extensively and I couldn't find anything. I also came across EEP10 but it
> does not look like it has been completely implemented (notably
> unicode:utf8_to_list is missing). I am using R13B03
> 
> Thanks in advance,
> 
> --b