[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Sat Mar 24 17:37:35 CET 2012

On 24 Mar 2012, at 11:36, Max Lapshin wrote:

> On Sat, Mar 24, 2012 at 2:11 AM, Tim Watson <watson.timothy@REDACTED> wrote:
>> So if you are willing to loose some space efficiency (which you would with mmap anyway) then reading the entire binary into memory is a lot faster:
>> 
> 
> Tim, looks like your solutions is twice faster due to using threads. Am I right?

Yes, although I still haven't quite hit the sweet spot with this approach. You can sequentially chunk your way through the file (using file:read/2 on a file opened with [raw, binary]) using 64k chunks and break these up into lines in around 533ms. Breaking up the 300k individual lines on the comma and collecting the results requires a bit more though about how best to split the work up, so currently there's not really that much of an improvement. I am playing around with this to see what's possible though, as it's an interesting problem.