[erlang-bugs] Re: UTF8 string handling in different erlang:*** functions

Nico Kruber kruber@REDACTED
Wed Mar 30 12:04:58 CEST 2011


On Wednesday 30 March 2011 11:47:28 Patrik Nyblom wrote:
> Hi!
> 
> To properly measure this, one has to bear in mind that
> erlang:term_to_binary(<constant expression>) gets evaluated at compile
> time, while unicode:characters_to_binary(<constant expression>) does not.

that's what I was thinking, too, but haven't had time to work around yet

> Using this program:
> -------------------
> t2bfun() ->
>      fun(X) -> erlang:term_to_binary(X) end.
> c2bfun() ->
>      fun(X) -> unicode:characters_to_binary(X,unicode) end.
> 
> iter(Count, F, String, Tag) ->
>      {_,Red0} = erlang:process_info(self(),reductions),
>      F(String),
>      {_,Red1} =  erlang:process_info(self(),reductions),
>      io:format("Reductions for one call: ~w~n",[Red1 - Red0]),
>      Start = erlang:now(),
>      iter_inner(Count, F, String),
>      Stop = erlang:now(),
>      ElapsedTime = timer:now_diff(Stop, Start) / 1000000.0,
>      Frequency = Count / ElapsedTime,
>      ct:pal("~p iterations of ~p took ~ps: ~p1/s~n",
>             [Count, Tag, ElapsedTime, Frequency]),
>      ok.
> 
> iter_inner(0, _,_) ->
>      ok;
> iter_inner(N, F, String) ->
>      F(String),
>      iter_inner(N - 1, F, String).
> ------------------
> doing:
> ------------------
> 26>
> StringWUnicode="jklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaa
> dakfdöäsakfdöäsjklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaad
> akfdöäsakfdöäsjklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaada
> kfdöäsakfdöäs".
> "jklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaadakfdöäsakfdöä
> sjklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaadakfdöäsakfdöäs
> jklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaadakfdöäsakfdöäs"
> 27> t:iter(1000000,t:c2bfun(),StringWUnicode,c2b).
> Reductions for one call: 30
> ----------------------------------------------------
> 2011-03-30 11:36:25.808
> 1000000 iterations of c2b took 3.280548s: 304827.120346966431/s
> 
> 
> ok
> 28> t:iter(1000000,t:t2bfun(),StringWUnicode,t2b).
> Reductions for one call: 4
> ----------------------------------------------------
> 2011-03-30 11:36:38.837
> 1000000 iterations of t2b took 1.72605s: 579357.49254077231/s
> 
> 
> ok
> 29> KostisString="some medium sized string here".
> "some medium sized string here"
> 30> t:iter(1000000,t:c2bfun(),KostisString,c2b).
> Reductions for one call: 6
> ----------------------------------------------------
> 2011-03-30 11:37:34.952
> 1000000 iterations of c2b took 0.543842s: 1838769.34845046891/s
> 
> 
> ok
> 31> t:iter(1000000,t:t2bfun(),KostisString,t2b).
> Reductions for one call: 4
> ----------------------------------------------------
> 2011-03-30 11:37:41.658
> 1000000 iterations of t2b took 0.362457s: 2758947.9579646691/s
> 
> 
> ok
> -----------------------
> - You get more correct measurements, showing a 2 to 3 speedup using
> term_to_binary.

-----------------------
using these tests, I get a similar result of around 2 speedup:
5> String2 = "qwertzuiopasdfghjklyxcvbnm" ++ 
[246,252,228,87,224,103,114,97,105,110,32,40,87,229,103,114,335,227,41].
[113,119,101,114,116,122,117,105,111,112,97,115,100,102,103,
 104,106,107,108,121,120,99,118,98,110,109,246,252,228|...]
6>  t:iter(1000000,t:c2bfun(),String2,c2b).                                                                          
Reductions for one call: 8
----------------------------------------------------
2011-03-30 11:56:05.669
1000000 iterations of c2b took 0.701959s: 1424584.6267374591/s


ok
7> 
7>  t:iter(1000000,t:t2bfun(),String2,c2b). 
Reductions for one call: 4
----------------------------------------------------
2011-03-30 11:56:14.630
1000000 iterations of c2b took 1.296981s: 771021.31796842061/s


ok
-----------------------

(I had to add a character larger than 255 manually as öäü are all below 256 
(246, 228, 252) - at least on my platform)

> The reasons are many:
> 1) unicode:characters_to_binary is a well behaved bif consuming
> reductions, which also means that it has to be more elaborate when
> allocating, because it may be interrupted. This is more of a problem in
> the ancient erlang:term_to_binary bif than one in the unicode bif.
> 2) unicode:characters_to_binary does more elaborate range checking, it
> only allows *valid* unicode characters, as described in the standard.
> 3) unicode:characters_to_binary may need some optimization, but using
> gprof, I find no really low hanging fruit.

> They are both bleading fast, so unless you plan to do huge amounts of md5
> calculations, my humble opinion is that you should use the one that suits
> your problem.

no, I'm perfectly fine with unicode:characters_to_binary (if speedup is only 
at 2)

Nico
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20110330/fcc6c17e/attachment.bin>


More information about the erlang-bugs mailing list