[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Sun Mar 25 23:02:21 CEST 2012

I think all these ideas have merit.

For (1) I agree with Joe that to take proper advantage of parallelism, you've got to redesign the problem space a little, possibly by removing the need for splitting, recombining and/or re-sequencing. The problem to my mind is that parsing a csv input stream (be it from a file or a socket or whatever) is more naturally a sequential task. If we add parallelism for performance gains, we pay later on when we have to 'work around' the issues that parallelism introduces (such as out of order data). Having said that, maybe your problem space isn't specifically about parsing a large range of csv inputs files.

BTW my macbook pro appears to have appalling performance characteristics when testing this, taking +/- 300ms longer to do the work:

t4@REDACTED:csv_reader $ make bench
./rebar compile
==> csv_reader (compile)
./csv_bench.erl example.csv
./csv_bench.erl:66: Warning: variable 'Evt' is unused
Load csv_reader: ok
NIF: 1202

Note that I also had to remove the io:format/2 debugging from the code, as this caused the code to take much longer:

t4@REDACTED:csv_reader $ make bench
./rebar compile
==> csv_reader (compile)
./csv_bench.erl example.csv
./csv_bench.erl:66: Warning: variable 'Evt' is unused
Load csv_reader: ok
csv_reader:133 {chunk,210,19650023}
....
csv_reader:142 {loader,<0.35.0>,finish}
NIF: 1828

I would *love* to know why all this io:format takes so much longer on my machine than yours, as I was under the impression that Erlang/OTP behaved nicely on darwin (apart from not doing proper kernel poll as kqueue seems pretty broken on OS-X) and this is making me wonder.  

On 25 Mar 2012, at 18:47, Max Lapshin wrote:

> So, currently my time is 850 ms for 300K line file which is 2.8 us per line.
> 
> https://github.com/maxlapshin/csv_reader
> 
> $ ./csv_bench.erl example.csv
> ./csv_bench.erl:66: Warning: variable 'Evt' is unused
> Load csv_reader: ok
> csv_reader:133 {chunk,210,19650181}
> ....
> csv_reader:142 {loader,<0.36.0>,finish}
> NIF: 851
> 
> 
> Ideas are following:
> 
> 1) Parse file in 4 threads. Detect where are real borders of lines of
> each part and spawn 4 workers with offsets and limits
> 2) Don't use prepending of data, left from previous block. If some
> line is read only partially, than read next block size(Rest) bytes
> earlier
> 3) Use special C nif that splits binaries into lines and parses them
> in the same call. Parsing is done via precompiled pattern.
> 
> 
> 
> It was a good idea to throw away non-erlang IO, because there are no
> lags in erlang file:read layer. But there are lags even in
> binary:split.  NIF that returns subbinary till first EOL is several
> times faster than binary:split(Bin, <<"\n">>).
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions