Unicode question

Brian Acton acton@REDACTED
Sat Mar 13 04:07:03 CET 2010


Hi guys,

I've tried to shorten my problem into a simple sub problem that hopefully,
someone can provide some insight.

Suppose, I am given the following utf-8 encoded input string:
[ 16#e4, 16#bd, 16#a0, 16#e5, 16#a5, 16#bd, 16#e2, 16#98, 16#ba ]

Which translates into three unicode code points:
[20320,22909,9786]

Now, I would like to shorten the string intelligently by choosing a
substring that retains character boundaries but also fits within a limited
number of bytes.

For example. Let's say my byte limit is 5 bytes.

In my example above, I can only take the first 3 bytes yielding the first
character. If I try to take the next character, I will have used 6 bytes
(each character in my example uses 3 bytes utf-8) and gone over my budgeted
byte allocation.

So, in order to solve this, I figured that I would need to be able to
convert utf-8 to code point and from code point to utf-8

The problem is that I can't figure out how to convert utf-8 to code point.
Everything that I have looked at yields code point to utf-8 conversion but I
have not found the inverse function which converts utf-8 to code point.

Am I going about this all wrong ? I've read the unicode page pretty
extensively and I couldn't find anything. I also came across EEP10 but it
does not look like it has been completely implemented (notably
unicode:utf8_to_list is missing). I am using R13B03

Thanks in advance,

--b


More information about the erlang-questions mailing list