[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Mon Mar 26 09:44:58 CEST 2012

On Domingo, 25 de Marzo de 2012 19:27:04 Joe Armstrong escribió:
> I'm joining this thread rather late, ...
> 
> Would it be possible to change the problem a bit?
> 
> The fastest approach would seem to be
> (assume a quad core):
> 
>     - spilt file into four
>     - parse each bit in parallel
>     - combine the results
> 
> Well splitting (the first part) is essentially sequential
> (or at least difficult to do in parallel unless you really
> understand your hardware - and may be impossible to do in parallel)

Maybe opening the file 4 times (and apropiate fseeks on the second, thirds 
etc..)?

You can read data from every descriptor so you get your file split (while you 
care not to overlap ranges during reads..)

> 
> I would assume that the input file has been created
> by a sequential program - if this were the case
> could it not produce four files instead of one?

> 
> If this were the case then the split into four bit would go away.
> 
> Do the bits have to be recombined later? - if not the
> last bit can be removed as well.
> 
> The key to performance might be to redesign the entire
> processing pipeline making it as parallel as possible
> as soon as possible and keeping it as parallel as possible as long as
> possible.
> 
> Cheers
> 
> /Joe
> 
> On Fri, Mar 23, 2012 at 11:30 AM, Max Lapshin <max.lapshin@REDACTED> wrote:
> > I need to load large CSV file into memory very fast.
> > 
> > I've tried to use erlang parser, but my results were very bad (in fact
> >  file:read_line is very slow), so I've tried to make a NIF for it.
> > Target speed is 1 microsecond per line.
> > 
> > 
> > My CSV has very strict format: only numbers, no quoting, \n in the
> > end. Also I moved parsing of integers and date into NIF.
> > 
> > My results are here: http://github.com/maxlapshin/csv_reader and I get
> > only 15 microseconds:  4,5 seconds for 300K lines CSV.
> > 
> > Currently I use fgets to read line by line from file. Maybe it is a
> > bad idea and I should use mmap or implement 1MB buffer for read?
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-- 
-
>>----------------------------------------------------------------------------

 Angel J. Alvarez Miguel, Servicios Informáticos
 Edificio Torre de Control, Campus Externo UAH
 Alcalá de Henares 28806, Madrid    ** ESPAÑA **

-------------[taH pagh taHbe', DaH mu'tlheghvam vİqelnİS]--<<-