[erlang-bugs] Re: UTF8 string handling in different erlang:*** functions

Tue Mar 29 16:10:28 CEST 2011

On Tuesday 29 March 2011 15:17:03 Bob Ippolito wrote:
> On Tue, Mar 29, 2011 at 9:06 AM, Nico Kruber <kruber@REDACTED> wrote:
> > is it possible that UTF8 strings are not supported by both
> > erlang:md5/1 and
> > erlang:list_to_binary/1 (and possibly more?)
> > 
> > I'm getting a bad argument exception when running the following:
> >> erlang:md5("Wàgrain (Wågrŏã)").
> > 
> > ** exception error: bad argument
> >     in function  erlang:md5/1
> >        called as
> > erlang:md5([87,224,103,114,97,105,110,32,40,87,229,103,114,335,227,
> >                              41])
> > 
> > even simpler, one can call:
> >> erlang:md5([256]).
> > 
> > ** exception error: bad argument
> >     in function  erlang:md5/1
> >        called as erlang:md5([256])
> > 
> > 
> > for characters larger than 255, this exception is thrown. same for
> > erlang:list_to_binary/1.
> > 
> > Both state that the input should be an iodata() or iolist() which are
> > defined as:
> > 
> > iodata() = iolist() | binary()
> > iolist() = [char() | binary() | iolist()]
> > %  a binary is allowed as the tail of the list
> > 
> > And according to
> > http://www.erlang.org/doc/reference_manual/typespec.html
> > a character is any valid integer between 0 and 16#10ffff and it should be
> > this way since erlang strings are unicode strings.
> > 
> > If this is correct behaviour, then how do I hash a unicode string without
> > using erlang:term_to_binary/1 (which is possibly costly and should be
> > unnecessary).
> 
> What you have is not UTF8, because UTF8 is defined over bytes
> (0..255).

oh, right - this was maybe misleading, I should have rather said "erlang 
string"

> IIRC, the actual definition of iolist should be
> maybe_improper_list(byte() | binary() | iolist(), binary()). Functions
> like erlang:list_to_binary/1 and erlang:md5/1 also only make sense
> over bytes.

ok, makes sense, although it is rather inconvenient not being able to hash 
strings :(

> You can convert a list of unicode code points (L) to UTF8 with
> unicode:characters_to_binary(L, utf8).

ok, thanks for the tip - FYI, I ran a simple benchmark executing 
unicode:characters_to_binary/1 and erlang:term_to_binary/1 a Million times 
with the same string which resulted in the following:

> 1000000 iterations of "erlang:term_to_binary/1" took 0.02946s: 
33944331.2966734541/s
> 1000000 iterations of "unicode:characters_to_binary/1" took 0.667519s: 
1498084.69871269591/s

-> looks like I should chose erlang:term_to_binary/1 since at least on my 
machine is is around twice as fast.

Nico
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20110329/eff9a8c8/attachment.bin>