[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Fri Mar 23 23:15:03 CET 2012

Forgot to cc the list....

On 23 Mar 2012, at 19:27, Tim Watson wrote:

> On 23 Mar 2012, at 19:19, Tim Watson wrote:
> 
>> On 23 Mar 2012, at 17:31, Max Lapshin wrote:
>> 
>>> On Fri, Mar 23, 2012 at 9:28 PM, Tim Watson <watson.timothy@REDACTED> wrote:
>>>> The problem doesn't appear to be anything to do with the speed of read_line, but rather one of binary splitting/processing instead. Consider this example, which simply uses file:read_line to get a list of lines from the 300k csv file:
>>>> 
>>>> read_line time: 1365
>>>> read size: 300000
>>> 
>>> But to be honest 1365 milliseconds is 10 times slower than plain C variant.
>> 
>> Sure I appreciate that.
>> 
>>> 
>>>> t4@REDACTED:csv_reader $
>>>> 
>>>> But just using binary:split/3 on the individual lines, without even processing the column cells, slows down considerably:
>>>> 
>>>> t4@REDACTED:csv_reader $ ./benchmark.erl example.csv
>>>> ./benchmark.erl:23: Warning: type transform_fun() is unused
>>>> read_line time: 12654
>>> 
>>> And naive splitting in C gives 3 times faster. Can you launch my code
>>> on your machine to normalize numbers?
>> 
>> t4@REDACTED:csv_reader $ ./csv_bench.erl example.csv 
>> Load csv_reader: ok
>> Load time: 5300
>> t4@REDACTED:csv_reader $ ./csv_bench.erl example.csv 
>> Load csv_reader: ok
>> Load time: 5213
>> t4@REDACTED:csv_reader $ evm info
>> R15B compiled for i386-apple-darwin10.8.0, 64bit
>> 
> 
> And again with a lower erts version:
> 
> t4@REDACTED:csv_reader $ ./csv_bench.erl example.csv 
> Load csv_reader: ok
> Load time: 5518
> t4@REDACTED:csv_reader $ evm info
> R14B01 compiled for i386-apple-darwin10.5.0, 64bit
> t4@REDACTED:csv_reader $ 
> 
> Hardware/OS profile: Apple Macbook Pro running OS-X 10.6.8 (Snow Leopard), 2.8GHz Intel Core2 Duo, 8Gb RAM.