[erlang-questions] Erlang read file benchmark

Evans, Matthew mevans@REDACTED
Sun Jul 10 04:43:49 CEST 2011


Using prim_file and the binary module produces:

[mevans@REDACTED ~]$  time perl reader.pl
Found 6032291 lines.

real    0m1.368s
user    0m1.295s
sys     0m0.068s
[mevans@REDACTED ~]$ time erl  -noshell -s reader main
Found 6048490 lines.

real    0m2.090s
user    0m1.965s
sys     0m0.111s


The Erlang code is reporting more lines because I am double counting cases where a line is split between reads. Of course, this would be easy to handle - I just wanted a proof of concept :

-module(reader).
-export([main/0]).
-define(NL,10).

main() ->
    {ok, File} = prim_file:open("data.log", [read,binary]),
    Lines = count_lines(File, 0),
    io:format("Found ~w lines.~n", [Lines]),
    halt().

count_lines(File, Count) ->
    case prim_file:read(File,8192) of
        {ok, Line} ->
            TC = length(binary:split(Line,[<<10>>],[global])),
            count_lines(File, Count+TC);
        _ ->
            Count
    end.


________________________________________
From: erlang-questions-bounces@REDACTED [erlang-questions-bounces@REDACTED] On Behalf Of Bob Ippolito [bob@REDACTED]
Sent: Saturday, July 09, 2011 7:33 PM
To: Kenny Stone
Cc: erlang-questions Questions
Subject: Re: [erlang-questions] Erlang read file benchmark

I think what most people want (especially for benchmarks) is something
that doesn't care about encodings and doesn't have a lot of
indirection. The current solution is VERY flexible, which comes at a
severe cost to performance in this case.

If you read the source code you'll see that file:read_file/1 calls
into io:request/2 which eventually (in another process) ends up in
file_io_server:io_request/2 and ends up reading either 128 bytes or
8kb at a time, doing some unicode junk, and ends up calling
io_lib:collect_line/4 to collect each line chunk at a time.

If I was trying to win a benchmark I'd probably go directly to
prim_file, do my own buffering, and use erlang:decode_packet/3 or the
binary module to split on the newlines. If I wanted to make a nicer
API I'd put that in a process to manage the buffering.

On Sat, Jul 9, 2011 at 4:08 PM, Kenny Stone <kennethstone@REDACTED> wrote:
> Why is it awful?
>
> On Sat, Jul 9, 2011 at 6:07 PM, Bob Ippolito <bob@REDACTED> wrote:
>>
>> file:read_line does some pretty awful things, I'd expect it to be very
>> slow. That said, there should be a much faster yet still easy way to
>> do this quickly but there isn't one baked into OTP that I know of.
>>
>> On Saturday, July 9, 2011, Michael Truog <mjtruog@REDACTED> wrote:
>> > He only showed the results on the command-line.  It would be nice to see
>> > results that show runtime without the startup/teardown overhead that the
>> > Erlang VM has, since it has a lot more going on than the perl interpreter.
>> >  I know he briefly mentioned that the difference seemed minimal, but he
>> > posted no results to show that.
>> >
>> > On 07/09/2011 12:15 PM, Evans, Matthew wrote:
>> >> Sorry if this is a duplicate email.
>> >>
>> >> I can understand Erlang being a bit slower than Perl for this. Can't
>> >> see an excuse for such a difference though.
>> >>
>> >> http://agentzh.org/#ErlangFileReadLineBenchmark
>> >>
>> >> Matt
>> >>
>> >> Sent from my iPhone
>> >> _______________________________________________
>> >> erlang-questions mailing list
>> >> erlang-questions@REDACTED
>> >> http://erlang.org/mailman/listinfo/erlang-questions
>> >>
>> >
>> > _______________________________________________
>> > erlang-questions mailing list
>> > erlang-questions@REDACTED
>> > http://erlang.org/mailman/listinfo/erlang-questions
>> >
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>
_______________________________________________
erlang-questions mailing list
erlang-questions@REDACTED
http://erlang.org/mailman/listinfo/erlang-questions



More information about the erlang-questions mailing list