[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Sun Mar 25 19:47:19 CEST 2012

So, currently my time is 850 ms for 300K line file which is 2.8 us per line.

https://github.com/maxlapshin/csv_reader

$ ./csv_bench.erl example.csv
./csv_bench.erl:66: Warning: variable 'Evt' is unused
Load csv_reader: ok
csv_reader:133 {chunk,210,19650181}
....
csv_reader:142 {loader,<0.36.0>,finish}
NIF: 851

Ideas are following:

1) Parse file in 4 threads. Detect where are real borders of lines of
each part and spawn 4 workers with offsets and limits
2) Don't use prepending of data, left from previous block. If some
line is read only partially, than read next block size(Rest) bytes
earlier
3) Use special C nif that splits binaries into lines and parses them
in the same call. Parsing is done via precompiled pattern.

It was a good idea to throw away non-erlang IO, because there are no
lags in erlang file:read layer. But there are lags even in
binary:split.  NIF that returns subbinary till first EOL is several
times faster than binary:split(Bin, <<"\n">>).