[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Fri Mar 23 18:28:49 CET 2012

The problem doesn't appear to be anything to do with the speed of read_line, but rather one of binary splitting/processing instead. Consider this example, which simply uses file:read_line to get a list of lines from the 300k csv file:

t4@REDACTED:csv_reader $ ./benchmark.erl example.csv 
./benchmark.erl:23: Warning: type transform_fun() is unused
./benchmark.erl:149: Warning: variable 'Pattern' is unused
read_line time: 1365
read size: 300000
t4@REDACTED:csv_reader $

But just using binary:split/3 on the individual lines, without even processing the column cells, slows down considerably:

t4@REDACTED:csv_reader $ ./benchmark.erl example.csv 
./benchmark.erl:23: Warning: type transform_fun() is unused
read_line time: 12654
read size: 300000
t4@REDACTED:csv_reader $ 

This is rather unfortunately slow, as binary processing is one of the things that Erlang is supposed to be exceptionally good at. The parsing code (which is part of a larger example I was playing with) boils down to these functions (with a few setup functions and record definitions omitted for brevity):

main([Path]) ->
    {ok, Fd} = file:open(Path, [raw, binary, {read_ahead, 1024 * 1024}]),
    try
        T1 = erlang:now(),
        Res = parse(make_parser(Fd), start),
        T2 = erlang:now(),
        io:format("read_line time: ~p~n", [timer:now_diff(T2, T1) div 1000]),
        io:format("read size: ~p~n", [length(Res)])
    catch 
        Ex:R ->
            io:format("~p~n", [erlang:get_stacktrace()]),
            throw({Ex, R})
    after
        file:close(Fd)
    end.

parse(P=#csv_parser{ iodevice=Fd, header=true }, start) ->
    file:read_line(Fd),
    parse(P, []);
parse(P=#csv_parser{ iodevice=Fd,
                     delimiter=Pattern }, Acc) ->
    case file:read_line(Fd) of
        {ok, Data} ->
            % Record = process(P, binary:split(Data, Pattern, [global])),
            Record = binary:split(Data, Pattern, [global]),
            parse(P, [Record|Acc]);
        eof ->
            Acc;
        Other ->
            throw(Other)
    end.

***************************

So it doesn't look like file:read_line is really the cause of the slowness. I've also tried several other processing examples (to actually generate the records and convert to floats etc) and of course these just add processing time. I do wonder if there is a more efficient manner or splitting the line oriented binaries and processing the data, but I am currently assuming that binary:split/3 is a pretty efficient mechanism. I might investigate this a little further at some point, if I get time.

Cheers,

Tim  

On 23 Mar 2012, at 13:10, Max Lapshin wrote:

> On Fri, Mar 23, 2012 at 4:54 PM, Gordon Guthrie <gordon@REDACTED> wrote:
>> Max
>> 
>> There is a csv parser for RFC 4180 compliant csv files which is now
>> being maintained by Eric Merritt:
>> https://github.com/afiniate/erfc_parsers/tree/master/src
>> 
> 
> It is great, but:
> 
> 1) it doesn't support header. Usually first line is a header
> 2) it is very strict to structure of CSV. It may change and it is
> impossible to tie to index of column, only by its name
> 3) it is incredibly slow: NIF: 4812, RFC: 38371
> It is 10 times slower than my code.
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions