[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second
Tim Watson
watson.timothy@REDACTED
Fri Mar 23 23:15:03 CET 2012
Forgot to cc the list....
On 23 Mar 2012, at 19:27, Tim Watson wrote:
> On 23 Mar 2012, at 19:19, Tim Watson wrote:
>
>> On 23 Mar 2012, at 17:31, Max Lapshin wrote:
>>
>>> On Fri, Mar 23, 2012 at 9:28 PM, Tim Watson <watson.timothy@REDACTED> wrote:
>>>> The problem doesn't appear to be anything to do with the speed of read_line, but rather one of binary splitting/processing instead. Consider this example, which simply uses file:read_line to get a list of lines from the 300k csv file:
>>>>
>>>> read_line time: 1365
>>>> read size: 300000
>>>
>>> But to be honest 1365 milliseconds is 10 times slower than plain C variant.
>>
>> Sure I appreciate that.
>>
>>>
>>>> t4@REDACTED:csv_reader $
>>>>
>>>> But just using binary:split/3 on the individual lines, without even processing the column cells, slows down considerably:
>>>>
>>>> t4@REDACTED:csv_reader $ ./benchmark.erl example.csv
>>>> ./benchmark.erl:23: Warning: type transform_fun() is unused
>>>> read_line time: 12654
>>>
>>> And naive splitting in C gives 3 times faster. Can you launch my code
>>> on your machine to normalize numbers?
>>
>> t4@REDACTED:csv_reader $ ./csv_bench.erl example.csv
>> Load csv_reader: ok
>> Load time: 5300
>> t4@REDACTED:csv_reader $ ./csv_bench.erl example.csv
>> Load csv_reader: ok
>> Load time: 5213
>> t4@REDACTED:csv_reader $ evm info
>> R15B compiled for i386-apple-darwin10.8.0, 64bit
>>
>
> And again with a lower erts version:
>
> t4@REDACTED:csv_reader $ ./csv_bench.erl example.csv
> Load csv_reader: ok
> Load time: 5518
> t4@REDACTED:csv_reader $ evm info
> R14B01 compiled for i386-apple-darwin10.5.0, 64bit
> t4@REDACTED:csv_reader $
>
> Hardware/OS profile: Apple Macbook Pro running OS-X 10.6.8 (Snow Leopard), 2.8GHz Intel Core2 Duo, 8Gb RAM.
More information about the erlang-questions
mailing list