[erlang-questions] Slow line oriented IO

Fri Feb 15 01:28:24 CET 2013

I think it is misunderstanding. There is not problem with erlang io
for normal files as you tested. Problem is with little bit unusual
files with few long lines as I attached. It causes some pathological
behavior in prim_file:read_line/1. It perfectly works for example for
another 3GB file with 16 million lines which are all evenly long about
198B. It can process it in nice two minutes which is pretty (29MB/s
and 148klines/s) but for another hundred times smaller file with seven
thousand less lines it takes half of this time which is terribly wrong
(0.5MB/s and 46lines/s). So I'm not pointing that erlang is bad for
processing files generally but there is some nasty bug in
prim_file:read_line/1 function.

On Thu, Feb 14, 2013 at 10:40 PM, Joe Armstrong <erlang@REDACTED> wrote:
> I wrote a little program to test this
>
> -module(test1).
>
> -compile(export_all).
>
> test1(N) ->
>     timer:tc(?MODULE, make, [N]).
>
> test2() ->
>     timer:tc(?MODULE, read, []).
>
>
> make(N) ->
>     Line = lists:seq(30,59) ++ [$\n],
>     file:write_file("big.tmp",
>    lists:duplicate(N,Line)).
>
> read() ->
>     {ok,B} = file:read_file("big.tmp"),
>     L = split(B, [], []),
>     length(L).
>
> split(<<10,B/binary>>, L1, L2) ->
>     split(B, [], [list_to_binary(lists:reverse(L1))|L2]);
> split(<<H,T/binary>>, L1, L2) ->
>     split(T, [H|L1], L2);
> split(<<>>, L1, L2) ->
>     lists:reverse([list_to_binary(lists:reverse(L1))|L2]).
>
> The timing I got were that Erlang was 9 times slower than perl (or wc)
> which is more or less what I expected. If I wanted to speed this up
> I'd write a NIF to split the binary at the first newline character.
>
> I actually always use file:read_file(F) for everything - since getting the
> entire file in at one go always seems a good idea and I have small files
> (compared to my RAM) - I'd use file:pread for files that are too big for
> memory and do random access read. Reading the entire file seems
> a good idea for files less than 100MB since I have 4GB of memory.
>
> The OS seems to do a better job of caching entire files than I could ever
> do so I don't worry about re-reading them ...
>
> I have no idea why you see a factor of 250 - is this a memory problem.
> How much memory have you got? Does your program scale linearly with
> the file size - or does something go suddenly wrong as you increase the
> size of the file?
>
> Cheers
>
> /Joe
>
>
> On Thu, Feb 14, 2013 at 3:46 PM, Hynek Vychodil <vychodil.hynek@REDACTED>
> wrote:
>>
>> Hello,
>> I know it was been already discussed here in list and it is also recurring
>> topic for at least five years. But anyway I have been bitten by it again
>> and
>> also found pretty pathological case. I have 30MB text file and it has a
>> few
>> near to 1MB lines there. (I can provide file with same line lengths if
>> somebody interested.) What I have been observed is that reading this file
>> using raw file:read_line/1 takes 51s! For comparison I have tried some
>> different approaches and what I got (line_read_test:read_std/1 is using
>> file:read_line/1):
>>
>> 1> timer:tc(line_read_test,read_std,["test.txt"]).
>> {51028105,2408}
>> 2> timer:tc(line_read_test,read,["test.txt"]).
>> {226220,2408}
>> 3> timer:tc(line_read_test,read_port,["test.txt"]).
>> {139388,2408}
>>
>> $time perl -nE'$i++}{say $i' test.txt
>> 2408
>>
>> real    0m0.053s
>> user    0m0.044s
>> sys     0m0.008s
>>
>> $ time wc -l test.txt
>> 2408 test.txt
>>
>> real    0m0.013s
>> user    0m0.004s
>> sys     0m0.008s
>>
>> $ time ./a.out test.txt
>> 2408
>>
>> real    0m0.020s
>> user    0m0.012s
>> sys     0m0.008s
>>
>> It means erlang should be at least 225 times faster (line_read_test:read/1
>> which has flow control). Erlang can be 350 times faster
>> (line_read_test:read_port/1 without flow control). Another high level
>> language (perl) is almost thousand times faster. Special C program is
>> almost
>> four thousands times faster and old good glibc is two and half thousands
>> times faster. Come on guys it is not even fun when simple (and wrong)
>> erlang
>> wrapper around standard module is more than two order of magnitude faster.
>> From mine experience when there is something two orders of magnitude
>> slower
>> it tells me there is something damn wrong. I have looked into efile_drv.c
>> and
>> ti is unfortunately far beyond mine C skill but if simple buffering and
>> binary:match/2 can outperform it 225 times there has to be something
>> rotten
>> in this code.
>>
>> I have also experimented with read_ahead option in file:open and changing
>> to
>> less value makes thing worse.
>>
>> Just to make grasp how bad it is, in same time I'm able sort 150 million
>> 64bit values (1.2GB of data) three times (one CPU core same HW). It is not
>> in
>> flow control, mine simple wrapper does flow control too. It can't make
>> current code less intrusive, if it consumes 100% CPU for 51s instead of
>> 226ms
>> then it will definitely affect whole server. It is not in concurrent
>> access,
>> mine code allows concurrent access too. Admitting there is something
>> broken
>> is first step to fixing it. I hope I helped at least in this way.
>>
>> With best regards
>>   Hynek Vychodil
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.txt.bz2
Type: application/x-bzip2
Size: 1043 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130215/5a696835/attachment.bin>