[erlang-questions] Slow line oriented IO

Thu Feb 14 22:40:41 CET 2013

I wrote a little program to test this

-module(test1).

-compile(export_all).

test1(N) ->
    timer:tc(?MODULE, make, [N]).

test2() ->
    timer:tc(?MODULE, read, []).

make(N) ->
    Line = lists:seq(30,59) ++ [$\n],
    file:write_file("big.tmp",
    lists:duplicate(N,Line)).

read() ->
    {ok,B} = file:read_file("big.tmp"),
    L = split(B, [], []),
    length(L).

split(<<10,B/binary>>, L1, L2) ->
    split(B, [], [list_to_binary(lists:reverse(L1))|L2]);
split(<<H,T/binary>>, L1, L2) ->
    split(T, [H|L1], L2);
split(<<>>, L1, L2) ->
    lists:reverse([list_to_binary(lists:reverse(L1))|L2]).

The timing I got were that Erlang was 9 times slower than perl (or wc)
which is more or less what I expected. If I wanted to speed this up
I'd write a NIF to split the binary at the first newline character.

I actually always use file:read_file(F) for everything - since getting the
entire file in at one go always seems a good idea and I have small files
(compared to my RAM) - I'd use file:pread for files that are too big for
memory and do random access read. Reading the entire file seems
a good idea for files less than 100MB since I have 4GB of memory.

The OS seems to do a better job of caching entire files than I could ever
do so I don't worry about re-reading them ...

I have no idea why you see a factor of 250 - is this a memory problem.
How much memory have you got? Does your program scale linearly with
the file size - or does something go suddenly wrong as you increase the
size of the file?

Cheers

/Joe

On Thu, Feb 14, 2013 at 3:46 PM, Hynek Vychodil <vychodil.hynek@REDACTED>wrote:

> Hello,
> I know it was been already discussed here in list and it is also recurring
> topic for at least five years. But anyway I have been bitten by it again
> and
> also found pretty pathological case. I have 30MB text file and it has a few
> near to 1MB lines there. (I can provide file with same line lengths if
> somebody interested.) What I have been observed is that reading this file
> using raw file:read_line/1 takes 51s! For comparison I have tried some
> different approaches and what I got (line_read_test:read_std/1 is using
> file:read_line/1):
>
> 1> timer:tc(line_read_test,read_std,["test.txt"]).
> {51028105,2408}
> 2> timer:tc(line_read_test,read,["test.txt"]).
> {226220,2408}
> 3> timer:tc(line_read_test,read_port,["test.txt"]).
> {139388,2408}
>
> $time perl -nE'$i++}{say $i' test.txt
> 2408
>
> real    0m0.053s
> user    0m0.044s
> sys     0m0.008s
>
> $ time wc -l test.txt
> 2408 test.txt
>
> real    0m0.013s
> user    0m0.004s
> sys     0m0.008s
>
> $ time ./a.out test.txt
> 2408
>
> real    0m0.020s
> user    0m0.012s
> sys     0m0.008s
>
> It means erlang should be at least 225 times faster (line_read_test:read/1
> which has flow control). Erlang can be 350 times faster
> (line_read_test:read_port/1 without flow control). Another high level
> language (perl) is almost thousand times faster. Special C program is
> almost
> four thousands times faster and old good glibc is two and half thousands
> times faster. Come on guys it is not even fun when simple (and wrong)
> erlang
> wrapper around standard module is more than two order of magnitude faster.
> From mine experience when there is something two orders of magnitude slower
> it tells me there is something damn wrong. I have looked into efile_drv.c
> and
> ti is unfortunately far beyond mine C skill but if simple buffering and
> binary:match/2 can outperform it 225 times there has to be something rotten
> in this code.
>
> I have also experimented with read_ahead option in file:open and changing
> to
> less value makes thing worse.
>
> Just to make grasp how bad it is, in same time I'm able sort 150 million
> 64bit values (1.2GB of data) three times (one CPU core same HW). It is not
> in
> flow control, mine simple wrapper does flow control too. It can't make
> current code less intrusive, if it consumes 100% CPU for 51s instead of
> 226ms
> then it will definitely affect whole server. It is not in concurrent
> access,
> mine code allows concurrent access too. Admitting there is something broken
> is first step to fixing it. I hope I helped at least in this way.
>
> With best regards
>   Hynek Vychodil
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130214/f6353cc9/attachment.htm>