[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Sun Mar 25 19:27:04 CEST 2012

I'm joining this thread rather late, ...

Would it be possible to change the problem a bit?

The fastest approach would seem to be
(assume a quad core):

    - spilt file into four
    - parse each bit in parallel
    - combine the results

Well splitting (the first part) is essentially sequential
(or at least difficult to do in parallel unless you really
understand your hardware - and may be impossible to do in parallel)

I would assume that the input file has been created
by a sequential program - if this were the case
could it not produce four files instead of one?

If this were the case then the split into four bit would go away.

Do the bits have to be recombined later? - if not the
last bit can be removed as well.

The key to performance might be to redesign the entire
processing pipeline making it as parallel as possible
as soon as possible and keeping it as parallel as possible as long as possible.

Cheers

/Joe

On Fri, Mar 23, 2012 at 11:30 AM, Max Lapshin <max.lapshin@REDACTED> wrote:
> I need to load large CSV file into memory very fast.
>
> I've tried to use erlang parser, but my results were very bad (in fact
>  file:read_line is very slow), so I've tried to make a NIF for it.
> Target speed is 1 microsecond per line.
>
>
> My CSV has very strict format: only numbers, no quoting, \n in the
> end. Also I moved parsing of integers and date into NIF.
>
> My results are here: http://github.com/maxlapshin/csv_reader and I get
> only 15 microseconds:  4,5 seconds for 300K lines CSV.
>
> Currently I use fgets to read line by line from file. Maybe it is a
> bad idea and I should use mmap or implement 1MB buffer for read?
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions