[erlang-questions] Not an Erlang fan

Caoyuan dcaoyuan@REDACTED
Mon Sep 24 14:30:38 CEST 2007


On 9/24/07, Thomas Lindgren <thomasl_erlang@REDACTED> wrote:
>
> --- Caoyuan <dcaoyuan@REDACTED> wrote:
>
> > For a 2M data file, test2/1 also cost about 300ms in
> > my machine, but
> > this is not a good result for 200M file too.
> >
> > What I'm wondering is, for Tim's example or a lot of
> > other cases, when
> > you should process a very big text/binary, you have
> > to travel it in
> > some way, so, what's the best efficient way?
> > Currently, all test2/1.
> > test3/1 or similar code are not efficient enough.
>
> Hi Caoyuan,
>
> It's of course difficult to give a single solution
> that is best in every instance. And sometimes, the
> best available solution might just not be fast enough.
> The previous mail is most of the toolbox I currently
> use on the micro level.
>
> 1. Try to process larger parts of binaries at once to
> amortize 'shared' work.
>
> 2. Use tricks like the "load byte" thing of the last
> example.
>
> 3. Sometimes, converting the binary to a list and
> traversing the list can be the best solution.
>

I try to split big binary to smaller pieces than, convert them to
list, here's the code:

scan(FileName) ->
    statistics(wall_clock),
    {ok, Bin} = file:read_file(FileName),
    {_, Duration1} = statistics(wall_clock),
    io:format("Duration reading: ~pms~n", [Duration1]),

    {Matched, Total} = split_scan(Bin, 1024 * 1024, []),
    {_, Duration2} = statistics(wall_clock),
    io:format("Duration ~pms~n Matched:~B, Total:~B~n", [Duration2,
Matched, Total]).

split_scan(<<>>, _SplitSize, _PreLeft) -> {0, 0};
split_scan(Bin, SplitSize, PreLeft) ->
    Size = size(Bin),
    io:format("splitting: ~B~n", [Size]),
    {Bin1, Rest} = if Size >= SplitSize ->
                           <<BinX:SplitSize/binary, RestX/binary>> = Bin,
                           {BinX, RestX};
                       true ->
                           {Bin, <<>>}
                   end,
    {Matched, Total, Left} = scan_line(binary_to_list(Bin1), PreLeft),
    split_scan(Rest, SplitSize, Left).

scan_line(Bin, PreLeft) -> scan_line(Bin, lists:reverse(PreLeft), 0, 0).
scan_line([], Line, Matched, Total) -> {Matched, Total, Line};
scan_line([$\n|Rest], Line, Matched, Total) ->
    Line1 = lists:reverse(Line),
    %Matched1 = Matched + process_match(Line1),
    scan_line(Rest, [], Matched, Total + 1);
scan_line([C|Rest], Line , Matched, Total) ->
    scan_line(Rest, [C|Line], Matched, Total).

But the costed time seems not decreased, or, even worse.

> 4. Use the erlang builtins, if applicable. (See the
> erlang module.)
>
> For instance, Ulf Wiger pointed to a trick that
> delivers data in lines at a high rate. Another
> approach (very hacky) could be to send the data via a
> loopback gen_tcp socket set in 'line' mode, then
> matching the lines as they arrive.
>
> 5. (Write erlang drivers.) This is fairly popular for
> some problems but then you're not doing Erlang
> programming anymore.
>
> On the mesolevel (sorry :-) a better algorithm may be
> what you need. But in this specific case, it might not
> be applicable.
>
> On the macro level, splitting the 200M file into
> smaller parts and processing each of them
> independently (and combining the results afterwards)
> probably maps nicely to multiple cores. That's the
> MapReduce or "make -j" approach, if you will. I think
> there was a fine example of this available via the
> comments on Tim Bray's blog (the WF II post).
>
> Likewise, you can probably overlap reading data from
> disk with processing it. Pipelining can be useful, but
> again, in this case the effects are probably small.
>
> Hope this helps.
>
> Best,
> Thomas
>
>
>
>
>

I read 'A Stream Library using Erlang Binaries' before:
http://www.duomark.com/erlang/publications/acm2005.pdf
Where a lot of implement information on Binary/List/Tuple in Erlang.

Thanks for these suggestions.

I'm with a lot of interesting to find the way to handle large
text/binary efficiently enough in Erlang, and like to see more discuss
on this topic.

Best,
Caoyuan


-- 
- Caoyuan



More information about the erlang-questions mailing list