[erlang-questions] Not an Erlang fan

Thomas Lindgren thomasl_erlang@REDACTED
Mon Sep 24 09:47:30 CEST 2007


--- Caoyuan <dcaoyuan@REDACTED> wrote:

> Tim's example is not about io, read whole file into
> binary is very
> quick, but, when you even simply travel a binary
> byte by byte, it cost
> a lot of time. I wrote a simple test module, and
> please take a look at
> test2/1, which is a funtion simply travel a binary.
> when the binary is
> read from a 200M file, travel it will cost about
> 30s.

> test2(FileName) ->
>     statistics(wall_clock),
>     {ok, Bin} = file:read_file(FileName),
>     Total = travel_bin(Bin),
>     {_, Duration} = statistics(wall_clock),
>     io:format("Duration ~pms~n Total:~B", [Duration,
> Total]).
> 
> travel_bin(Bin) -> travel_bin(Bin, 0).
> travel_bin(<<>>, ByteCount) -> ByteCount;
> travel_bin(<<_C:1/binary, Rest/binary>>, ByteCount)
> ->
>     travel_bin(Rest, ByteCount + 1).

In travel_bin/1, I have two observations: 

First, there is no need to turn C into a binary. You
can get the byte directly as

  <<C, Rest/binary>>

That seems simpler and avoids building a binary
holding the single byte C, saving a couple of words of
heap allocation per element.

Second, iterating over the binary that way is a common
mistake. When you write <<C, Rest/binary>>, you also
build a new binary Rest (which is 'just' a couple of
pointers, perhaps 3 words). But that is actually
_more_ memory consuming than turning the binary into a
list (2 words per element). So perhaps simply turning
the binary into a list (all at once or in stages) and
processing that would be the faster option.

One way of speeding things up while keeping the same
program structure is to get more than one byte at a
time, e.g:

loop(<<C0,C1,C2,C3,C4,C5,C6,C7, Rest/binary>>) ->
   ..., loop(Rest);
loop(Small_bin) -> ...

A less pleasant, but more efficient, way of accessing
each element of a binary is this:

count_bin(Bin) ->
    M = 0,
    N = size(Bin),
    count(M, N, Bin, 0).

count(M, N, Bin, Size) ->
    case Bin of
	<<_:M/binary, C, _/binary>> ->
	    count(M+1, N, Bin, Size+1);
	_ ->
	    Size
    end.

(use it like
   {ok, B} = file:read_file("o10k.txt"),
   count_bin(B).
)

On my machine (1*1.7 GHz, 0.5 GB), the above code
walks the o10k 2MB binary in 236 ms. (Using Hipe, 90
ms.) Granted, this is a silly example, but it shows
that accessing all the binary elements need not be
extremely slow either.

(NB: similar code collects the positions of all
newlines $\n in the binary in 251 ms.)

Best,
Thomas



       
____________________________________________________________________________________
Need a vacation? Get great deals
to amazing places on Yahoo! Travel.
http://travel.yahoo.com/



More information about the erlang-questions mailing list