[erlang-bugs] Re: UTF8 string handling in different erlang:*** functions

Bob Ippolito bob@REDACTED
Tue Mar 29 15:17:03 CEST 2011


On Tue, Mar 29, 2011 at 9:06 AM, Nico Kruber <kruber@REDACTED> wrote:
> is it possible that UTF8 strings are not supported by both
> erlang:md5/1 and
> erlang:list_to_binary/1 (and possibly more?)
>
> I'm getting a bad argument exception when running the following:
>
>> erlang:md5("Wàgrain (Wågrŏã)").
> ** exception error: bad argument
>     in function  erlang:md5/1
>        called as
> erlang:md5([87,224,103,114,97,105,110,32,40,87,229,103,114,335,227,
>                              41])
>
> even simpler, one can call:
>> erlang:md5([256]).
> ** exception error: bad argument
>     in function  erlang:md5/1
>        called as erlang:md5([256])
>
>
> for characters larger than 255, this exception is thrown. same for
> erlang:list_to_binary/1.
>
> Both state that the input should be an iodata() or iolist() which are defined
> as:
>
> iodata() = iolist() | binary()
> iolist() = [char() | binary() | iolist()]
> %  a binary is allowed as the tail of the list
>
> And according to
> http://www.erlang.org/doc/reference_manual/typespec.html
> a character is any valid integer between 0 and 16#10ffff and it should be this
> way since erlang strings are unicode strings.
>
> If this is correct behaviour, then how do I hash a unicode string without
> using erlang:term_to_binary/1 (which is possibly costly and should be
> unnecessary).

What you have is not UTF8, because UTF8 is defined over bytes
(0..255). IIRC, the actual definition of iolist should be
maybe_improper_list(byte() | binary() | iolist(), binary()). Functions
like erlang:list_to_binary/1 and erlang:md5/1 also only make sense
over bytes.

You can convert a list of unicode code points (L) to UTF8 with
unicode:characters_to_binary(L, utf8).

-bob



More information about the erlang-bugs mailing list