[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Ulf Wiger <>
Fri Mar 23 12:29:15 CET 2012


open_port({spawn, "/bin/cat " ++ File}, [{line, MaxLen}, binary]) 

will pour the file, one line at a time, into your message queue. :)

Wicked fast, but no flow control.

Fredrik Svahn made a flow-control hack for stdin back in 2008, but I don't know (a) if that's applicable to you or (b) if it ever made it into the OTP source somehow.

http://erlang.org/pipermail/erlang-bugs/2008-December/001136.html

Otherwise, opening the file in [raw, binary] mode and using re:split() ought to work reasonably well.

file:read_line() is indeed dreadfully slow, but is OTOH extremely nice in distributed embedded systems, as it supports IO redirection across nodes.

BR,
Ulf W

On 23 Mar 2012, at 11:30, Max Lapshin wrote:

> I need to load large CSV file into memory very fast.
> 
> I've tried to use erlang parser, but my results were very bad (in fact
> file:read_line is very slow), so I've tried to make a NIF for it.
> Target speed is 1 microsecond per line.
> 
> 
> My CSV has very strict format: only numbers, no quoting, \n in the
> end. Also I moved parsing of integers and date into NIF.
> 
> My results are here: http://github.com/maxlapshin/csv_reader and I get
> only 15 microseconds:  4,5 seconds for 300K lines CSV.
> 
> Currently I use fgets to read line by line from file. Maybe it is a
> bad idea and I should use mmap or implement 1MB buffer for read?
> _______________________________________________
> erlang-questions mailing list
> 
> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120323/5a7c9466/attachment.html>


More information about the erlang-questions mailing list