[erlang-questions] Not an Erlang fan

Mon Sep 24 18:20:08 CEST 2007

--- Caoyuan <dcaoyuan@REDACTED> wrote:

> On 9/24/07, Thomas Lindgren
> <thomasl_erlang@REDACTED> wrote:
> >
> > --- Caoyuan <dcaoyuan@REDACTED> wrote:
> >
> > > For a 2M data file, test2/1 also cost about
> 300ms in
> > > my machine, but
> > > this is not a good result for 200M file too.
> > >
> > > What I'm wondering is, for Tim's example or a
> lot of
> > > other cases, when
> > > you should process a very big text/binary, you
> have
> > > to travel it in
> > > some way, so, what's the best efficient way?
> > > Currently, all test2/1.
> > > test3/1 or similar code are not efficient
> enough.
> >
> > Hi Caoyuan,
> >
> > It's of course difficult to give a single solution
> > that is best in every instance. And sometimes, the
> > best available solution might just not be fast
> enough.
> > The previous mail is most of the toolbox I
> currently
> > use on the micro level.
> >
> > 1. Try to process larger parts of binaries at once
> to
> > amortize 'shared' work.
> >
> > 2. Use tricks like the "load byte" thing of the
> last
> > example.
> >
> > 3. Sometimes, converting the binary to a list and
> > traversing the list can be the best solution.
> >
> 
> I try to split big binary to smaller pieces than,
> convert them to
> list, here's the code:
> 
> scan(FileName) ->
>     statistics(wall_clock),
>     {ok, Bin} = file:read_file(FileName),
>     {_, Duration1} = statistics(wall_clock),
>     io:format("Duration reading: ~pms~n",
> [Duration1]),
> 
>     {Matched, Total} = split_scan(Bin, 1024 * 1024,
> []),
>     {_, Duration2} = statistics(wall_clock),
>     io:format("Duration ~pms~n Matched:~B,
> Total:~B~n", [Duration2,
> Matched, Total]).
> 
> split_scan(<<>>, _SplitSize, _PreLeft) -> {0, 0};
> split_scan(Bin, SplitSize, PreLeft) ->
>     Size = size(Bin),
>     io:format("splitting: ~B~n", [Size]),
>     {Bin1, Rest} = if Size >= SplitSize ->
>                            <<BinX:SplitSize/binary,
> RestX/binary>> = Bin,
>                            {BinX, RestX};
>                        true ->
>                            {Bin, <<>>}
>                    end,
>     {Matched, Total, Left} =
> scan_line(binary_to_list(Bin1), PreLeft),
>     split_scan(Rest, SplitSize, Left).
> 
> scan_line(Bin, PreLeft) -> scan_line(Bin,
> lists:reverse(PreLeft), 0, 0).
> scan_line([], Line, Matched, Total) -> {Matched,
> Total, Line};
> scan_line([$\n|Rest], Line, Matched, Total) ->
>     Line1 = lists:reverse(Line),
>     %Matched1 = Matched + process_match(Line1),
>     scan_line(Rest, [], Matched, Total + 1);
> scan_line([C|Rest], Line , Matched, Total) ->
>     scan_line(Rest, [C|Line], Matched, Total).
> 
> But the costed time seems not decreased, or, even
> worse.
> 
> > 4. Use the erlang builtins, if applicable. (See
> the
> > erlang module.)
> >
> > For instance, Ulf Wiger pointed to a trick that
> > delivers data in lines at a high rate. Another
> > approach (very hacky) could be to send the data
> via a
> > loopback gen_tcp socket set in 'line' mode, then
> > matching the lines as they arrive.
> >
> > 5. (Write erlang drivers.) This is fairly popular
> for
> > some problems but then you're not doing Erlang
> > programming anymore.
> >
> > On the mesolevel (sorry :-) a better algorithm may
> be
> > what you need. But in this specific case, it might
> not
> > be applicable.
> >
> > On the macro level, splitting the 200M file into
> > smaller parts and processing each of them
> > independently (and combining the results
> afterwards)
> > probably maps nicely to multiple cores. That's the
> > MapReduce or "make -j" approach, if you will. I
> think
> > there was a fine example of this available via the
> > comments on Tim Bray's blog (the WF II post).
> >
> > Likewise, you can probably overlap reading data
> from
> > disk with processing it. Pipelining can be useful,
> but
> > again, in this case the effects are probably
> small.
> >
> > Hope this helps.
> >
> > Best,
> > Thomas
> >
> >
> >
> >
> >
> 
> I read 'A Stream Library using Erlang Binaries'
> before:
>
http://www.duomark.com/erlang/publications/acm2005.pdf
> Where a lot of implement information on
> Binary/List/Tuple in Erlang.
> 
> Thanks for these suggestions.
> 
> I'm with a lot of interesting to find the way to
> handle large
> text/binary efficiently enough in Erlang, and like
> to see more discuss
> on this topic.

A small bit of further data:

I built a 200M file from the original 2M data. On my
machine, the final step of pasting together 10 20M
files with cat needed 5.54 seconds real, user 0.04;
sys 1.35. Presumably, my laptop won't do much better
than that in the I/O department.

Reading the file back into Erlang (using
file:read_file) needed 8.147 seconds. Not too shabby,
but it could do better.

Finally, using the R11 "load byte" idiom of traversing
the whole binary to locate all newlines needed 6.28
seconds (ie, running at 31.8 MB/s if you will :-).
(Hipe didn't improve things.)

It would be interesting to see how and whether reading
and processing the file could be overlapped and
split/parallelized. (It doesn't seem impossible, does
it?)

Note that this does not include the actual regexp
search of each line ... The code to do that could be
an FSM + dictionary insert instead of the simple
check-and-accumulate for newline. Extra cost? Unclear.

One could merge that code into the scanning the binary
but this should ideally be done by a regexp compiler,
shouldn't it? Not done at this time AFAIK.

Best,
Thomas

      ____________________________________________________________________________________
Shape Yahoo! in your own image.  Join our Network Research Panel today!   http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7