[erlang-questions] Unicode question

Sat Mar 13 19:46:13 CET 2010

Thanks everyone,

I was getting fundamentally confused about binary vs list in the input
variables. I was doing the following:

1> L = [ 16#e4, 16#bd, 16#a0, 16#e5, 16#a5, 16#bd, 16#e2, 16#98, 16#ba ].
[228,189,160,229,165,189,226,152,186
2> unicode:characters_to_list(L).
[228,189,160,229,165,189,226,152,186

instead of the correct version:

1> L = [ 16#e4, 16#bd, 16#a0, 16#e5, 16#a5, 16#bd, 16#e2, 16#98, 16#ba ].
[228,189,160,229,165,189,226,152,186
2> unicode:characters_to_list(list_to_binary(L)).
[20320,22909,9786]

I think the fundamental takeaway is that binaries are encodings (i.e. UTF8,
16, 32) and that lists are code points (i.e. U+4F60, U+597D)

Thanks for everyone's help on this.

--b

On Sat, Mar 13, 2010 at 8:33 AM, Dominic Williams <
erlang-dated-1268930029.1d9556@REDACTED> wrote:

> Hi Brian,
>
> Below is the usage of the unicode module to convert to and from unicode
> points:
>
> 1> L = [ 16#e4, 16#bd, 16#a0, 16#e5, 16#a5, 16#bd, 16#e2, 16#98, 16#ba ].
> [228,189,160,229,165,189,226,152,186]
> 2> unicode:characters_to_list(list_to_binary(L)).
> [20320,22909,9786]
> 3> unicode:characters_to_binary([20320,22909,9786]).
> <<228,189,160,229,165,189,226,152,186>>
> 4> binary_to_list(unicode:characters_to_binary([20320,22909,9786])).
> [228,189,160,229,165,189,226,152,186]
>
> Regards,
>
> Dominic Williams
> http://dominicwilliams.net
>
> Le 13 mars 2010 à 04:07, Brian Acton a écrit :
>
> > Hi guys,
> >
> > I've tried to shorten my problem into a simple sub problem that
> hopefully,
> > someone can provide some insight.
> >
> > Suppose, I am given the following utf-8 encoded input string:
> > [ 16#e4, 16#bd, 16#a0, 16#e5, 16#a5, 16#bd, 16#e2, 16#98, 16#ba ]
> >
> > Which translates into three unicode code points:
> > [20320,22909,9786]
> >
> > Now, I would like to shorten the string intelligently by choosing a
> > substring that retains character boundaries but also fits within a
> limited
> > number of bytes.
> >
> > For example. Let's say my byte limit is 5 bytes.
> >
> > In my example above, I can only take the first 3 bytes yielding the first
> > character. If I try to take the next character, I will have used 6 bytes
> > (each character in my example uses 3 bytes utf-8) and gone over my
> budgeted
> > byte allocation.
> >
> > So, in order to solve this, I figured that I would need to be able to
> > convert utf-8 to code point and from code point to utf-8
> >
> > The problem is that I can't figure out how to convert utf-8 to code
> point.
> > Everything that I have looked at yields code point to utf-8 conversion
> but I
> > have not found the inverse function which converts utf-8 to code point.
> >
> > Am I going about this all wrong ? I've read the unicode page pretty
> > extensively and I couldn't find anything. I also came across EEP10 but it
> > does not look like it has been completely implemented (notably
> > unicode:utf8_to_list is missing). I am using R13B03
> >
> > Thanks in advance,
> >
> > --b
>