[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Fri Mar 23 18:31:45 CET 2012

On Fri, Mar 23, 2012 at 9:28 PM, Tim Watson <watson.timothy@REDACTED> wrote:
> The problem doesn't appear to be anything to do with the speed of read_line, but rather one of binary splitting/processing instead. Consider this example, which simply uses file:read_line to get a list of lines from the 300k csv file:
>
> read_line time: 1365
> read size: 300000

But to be honest 1365 milliseconds is 10 times slower than plain C variant.

> t4@REDACTED:csv_reader $
>
> But just using binary:split/3 on the individual lines, without even processing the column cells, slows down considerably:
>
> t4@REDACTED:csv_reader $ ./benchmark.erl example.csv
> ./benchmark.erl:23: Warning: type transform_fun() is unused
> read_line time: 12654

And naive splitting in C gives 3 times faster. Can you launch my code
on your machine to normalize numbers?