[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Max Lapshin max.lapshin@REDACTED
Sun Mar 25 19:47:19 CEST 2012


So, currently my time is 850 ms for 300K line file which is 2.8 us per line.

https://github.com/maxlapshin/csv_reader

$ ./csv_bench.erl example.csv
./csv_bench.erl:66: Warning: variable 'Evt' is unused
Load csv_reader: ok
csv_reader:133 {chunk,210,19650181}
....
csv_reader:142 {loader,<0.36.0>,finish}
NIF: 851


Ideas are following:

1) Parse file in 4 threads. Detect where are real borders of lines of
each part and spawn 4 workers with offsets and limits
2) Don't use prepending of data, left from previous block. If some
line is read only partially, than read next block size(Rest) bytes
earlier
3) Use special C nif that splits binaries into lines and parses them
in the same call. Parsing is done via precompiled pattern.



It was a good idea to throw away non-erlang IO, because there are no
lags in erlang file:read layer. But there are lags even in
binary:split.  NIF that returns subbinary till first EOL is several
times faster than binary:split(Bin, <<"\n">>).



More information about the erlang-questions mailing list