[erlang-bugs] Re: UTF8 string handling in different erlang:*** functions

Tue Mar 29 16:29:38 CEST 2011

On Tue, Mar 29, 2011 at 10:10 AM, Nico Kruber <kruber@REDACTED> wrote:
> On Tuesday 29 March 2011 15:17:03 Bob Ippolito wrote:
>> On Tue, Mar 29, 2011 at 9:06 AM, Nico Kruber <kruber@REDACTED> wrote:
>> > is it possible that UTF8 strings are not supported by both
>> > erlang:md5/1 and
>> > erlang:list_to_binary/1 (and possibly more?)
>> >
>> > I'm getting a bad argument exception when running the following:
>> >> erlang:md5("Wàgrain (Wågrŏã)").
>> >
>> > ** exception error: bad argument
>> >     in function  erlang:md5/1
>> >        called as
>> > erlang:md5([87,224,103,114,97,105,110,32,40,87,229,103,114,335,227,
>> >                              41])
>> >
>> > even simpler, one can call:
>> >> erlang:md5([256]).
>> >
>> > ** exception error: bad argument
>> >     in function  erlang:md5/1
>> >        called as erlang:md5([256])
>> >
>> >
>> > for characters larger than 255, this exception is thrown. same for
>> > erlang:list_to_binary/1.
>> >
>> > Both state that the input should be an iodata() or iolist() which are
>> > defined as:
>> >
>> > iodata() = iolist() | binary()
>> > iolist() = [char() | binary() | iolist()]
>> > %  a binary is allowed as the tail of the list
>> >
>> > And according to
>> > http://www.erlang.org/doc/reference_manual/typespec.html
>> > a character is any valid integer between 0 and 16#10ffff and it should be
>> > this way since erlang strings are unicode strings.
>> >
>> > If this is correct behaviour, then how do I hash a unicode string without
>> > using erlang:term_to_binary/1 (which is possibly costly and should be
>> > unnecessary).
>>
>> What you have is not UTF8, because UTF8 is defined over bytes
>> (0..255).
>
> oh, right - this was maybe misleading, I should have rather said "erlang
> string"
>
>> IIRC, the actual definition of iolist should be
>> maybe_improper_list(byte() | binary() | iolist(), binary()). Functions
>> like erlang:list_to_binary/1 and erlang:md5/1 also only make sense
>> over bytes.
>
> ok, makes sense, although it is rather inconvenient not being able to hash
> strings :(

The real lesson here is "do not use erlang strings". Binaries in UTF8
are better for most use cases that I've come across in the past few
years. A bit uglier in the source, but the memory and performance
benefits make it worthwhile.

>> You can convert a list of unicode code points (L) to UTF8 with
>> unicode:characters_to_binary(L, utf8).
>
> ok, thanks for the tip - FYI, I ran a simple benchmark executing
> unicode:characters_to_binary/1 and erlang:term_to_binary/1 a Million times
> with the same string which resulted in the following:
>
>> 1000000 iterations of "erlang:term_to_binary/1" took 0.02946s:
> 33944331.2966734541/s
>> 1000000 iterations of "unicode:characters_to_binary/1" took 0.667519s:
> 1498084.69871269591/s
>
> -> looks like I should chose erlang:term_to_binary/1 since at least on my
> machine is is around twice as fast.

I guess it depends on if you care what the result is... these
operations are completely different, and there's not even any
guarantee that erlang:term_to_binary/1 is always going to return the
same output for a given input... there is more than one possible
representation for a string in external term format, and the spec does
not guarantee that the implementation will do it any particular way.

-bob