[erlang-questions] Unicode question

Sat Mar 13 20:41:23 CET 2010

On Fri, Mar 12, 2010 at 7:07 PM, Brian Acton <acton@REDACTED> wrote:
> I've tried to shorten my problem into a simple sub problem that hopefully,
> someone can provide some insight.
>
> Suppose, I am given the following utf-8 encoded input string:
> [ 16#e4, 16#bd, 16#a0, 16#e5, 16#a5, 16#bd, 16#e2, 16#98, 16#ba ]
>
> Which translates into three unicode code points:
> [20320,22909,9786]
>
> Now, I would like to shorten the string intelligently by choosing a
> substring that retains character boundaries but also fits within a limited
> number of bytes.

I have a partial solution to your problem that I just now committed to
the mochiweb repository.

http://mochiweb.googlecode.com/svn/trunk/src/mochiutf8.erl

Essentially what it does is given a binary it returns only the bytes
that represent a valid UTF-8 sequence.

Using this code, you could take an arbitrary (presumed UTF-8) binary,
chop it at any point, and after going through
mochiutf8:valid_utf8_bytes/1 you'll have a valid UTF-8 binary of zero
or more characters.

The reason why this doesn't use any standard library functions is
because I couldn't find any exported functionality that lets you do
anything about invalid data, so I had to write my own.

-bob