[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second
Angel J. Alvarez Miguel
Mon Mar 26 09:44:58 CEST 2012
On Domingo, 25 de Marzo de 2012 19:27:04 Joe Armstrong escribió:
> I'm joining this thread rather late, ...
> Would it be possible to change the problem a bit?
> The fastest approach would seem to be
> (assume a quad core):
> - spilt file into four
> - parse each bit in parallel
> - combine the results
> Well splitting (the first part) is essentially sequential
> (or at least difficult to do in parallel unless you really
> understand your hardware - and may be impossible to do in parallel)
Maybe opening the file 4 times (and apropiate fseeks on the second, thirds
You can read data from every descriptor so you get your file split (while you
care not to overlap ranges during reads..)
> I would assume that the input file has been created
> by a sequential program - if this were the case
> could it not produce four files instead of one?
> If this were the case then the split into four bit would go away.
> Do the bits have to be recombined later? - if not the
> last bit can be removed as well.
> The key to performance might be to redesign the entire
> processing pipeline making it as parallel as possible
> as soon as possible and keeping it as parallel as possible as long as
> On Fri, Mar 23, 2012 at 11:30 AM, Max Lapshin <max.lapshin@REDACTED> wrote:
> > I need to load large CSV file into memory very fast.
> > I've tried to use erlang parser, but my results were very bad (in fact
> > file:read_line is very slow), so I've tried to make a NIF for it.
> > Target speed is 1 microsecond per line.
> > My CSV has very strict format: only numbers, no quoting, \n in the
> > end. Also I moved parsing of integers and date into NIF.
> > My results are here: http://github.com/maxlapshin/csv_reader and I get
> > only 15 microseconds: 4,5 seconds for 300K lines CSV.
> > Currently I use fgets to read line by line from file. Maybe it is a
> > bad idea and I should use mmap or implement 1MB buffer for read?
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> erlang-questions mailing list
Angel J. Alvarez Miguel, Servicios Informáticos
Edificio Torre de Control, Campus Externo UAH
Alcalá de Henares 28806, Madrid ** ESPAÑA **
-------------[taH pagh taHbe', DaH mu'tlheghvam vİqelnİS]--<<-
More information about the erlang-questions