[erlang-bugs] Re: UTF8 string handling in different erlang:*** functions
Nico Kruber
kruber@REDACTED
Wed Mar 30 12:04:58 CEST 2011
On Wednesday 30 March 2011 11:47:28 Patrik Nyblom wrote:
> Hi!
>
> To properly measure this, one has to bear in mind that
> erlang:term_to_binary(<constant expression>) gets evaluated at compile
> time, while unicode:characters_to_binary(<constant expression>) does not.
that's what I was thinking, too, but haven't had time to work around yet
> Using this program:
> -------------------
> t2bfun() ->
> fun(X) -> erlang:term_to_binary(X) end.
> c2bfun() ->
> fun(X) -> unicode:characters_to_binary(X,unicode) end.
>
> iter(Count, F, String, Tag) ->
> {_,Red0} = erlang:process_info(self(),reductions),
> F(String),
> {_,Red1} = erlang:process_info(self(),reductions),
> io:format("Reductions for one call: ~w~n",[Red1 - Red0]),
> Start = erlang:now(),
> iter_inner(Count, F, String),
> Stop = erlang:now(),
> ElapsedTime = timer:now_diff(Stop, Start) / 1000000.0,
> Frequency = Count / ElapsedTime,
> ct:pal("~p iterations of ~p took ~ps: ~p1/s~n",
> [Count, Tag, ElapsedTime, Frequency]),
> ok.
>
> iter_inner(0, _,_) ->
> ok;
> iter_inner(N, F, String) ->
> F(String),
> iter_inner(N - 1, F, String).
> ------------------
> doing:
> ------------------
> 26>
> StringWUnicode="jklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaa
> dakfdöäsakfdöäsjklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaad
> akfdöäsakfdöäsjklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaada
> kfdöäsakfdöäs".
> "jklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaadakfdöäsakfdöä
> sjklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaadakfdöäsakfdöäs
> jklöfdsakföädskfdöäsaadakfdöäsakfdöäsjklöfdsakföädskfdöäsaadakfdöäsakfdöäs"
> 27> t:iter(1000000,t:c2bfun(),StringWUnicode,c2b).
> Reductions for one call: 30
> ----------------------------------------------------
> 2011-03-30 11:36:25.808
> 1000000 iterations of c2b took 3.280548s: 304827.120346966431/s
>
>
> ok
> 28> t:iter(1000000,t:t2bfun(),StringWUnicode,t2b).
> Reductions for one call: 4
> ----------------------------------------------------
> 2011-03-30 11:36:38.837
> 1000000 iterations of t2b took 1.72605s: 579357.49254077231/s
>
>
> ok
> 29> KostisString="some medium sized string here".
> "some medium sized string here"
> 30> t:iter(1000000,t:c2bfun(),KostisString,c2b).
> Reductions for one call: 6
> ----------------------------------------------------
> 2011-03-30 11:37:34.952
> 1000000 iterations of c2b took 0.543842s: 1838769.34845046891/s
>
>
> ok
> 31> t:iter(1000000,t:t2bfun(),KostisString,t2b).
> Reductions for one call: 4
> ----------------------------------------------------
> 2011-03-30 11:37:41.658
> 1000000 iterations of t2b took 0.362457s: 2758947.9579646691/s
>
>
> ok
> -----------------------
> - You get more correct measurements, showing a 2 to 3 speedup using
> term_to_binary.
-----------------------
using these tests, I get a similar result of around 2 speedup:
5> String2 = "qwertzuiopasdfghjklyxcvbnm" ++
[246,252,228,87,224,103,114,97,105,110,32,40,87,229,103,114,335,227,41].
[113,119,101,114,116,122,117,105,111,112,97,115,100,102,103,
104,106,107,108,121,120,99,118,98,110,109,246,252,228|...]
6> t:iter(1000000,t:c2bfun(),String2,c2b).
Reductions for one call: 8
----------------------------------------------------
2011-03-30 11:56:05.669
1000000 iterations of c2b took 0.701959s: 1424584.6267374591/s
ok
7>
7> t:iter(1000000,t:t2bfun(),String2,c2b).
Reductions for one call: 4
----------------------------------------------------
2011-03-30 11:56:14.630
1000000 iterations of c2b took 1.296981s: 771021.31796842061/s
ok
-----------------------
(I had to add a character larger than 255 manually as öäü are all below 256
(246, 228, 252) - at least on my platform)
> The reasons are many:
> 1) unicode:characters_to_binary is a well behaved bif consuming
> reductions, which also means that it has to be more elaborate when
> allocating, because it may be interrupted. This is more of a problem in
> the ancient erlang:term_to_binary bif than one in the unicode bif.
> 2) unicode:characters_to_binary does more elaborate range checking, it
> only allows *valid* unicode characters, as described in the standard.
> 3) unicode:characters_to_binary may need some optimization, but using
> gprof, I find no really low hanging fruit.
> They are both bleading fast, so unless you plan to do huge amounts of md5
> calculations, my humble opinion is that you should use the one that suits
> your problem.
no, I'm perfectly fine with unicode:characters_to_binary (if speedup is only
at 2)
Nico
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20110330/fcc6c17e/attachment.bin>
More information about the erlang-bugs
mailing list