[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Toby Thain toby@REDACTED
Tue Mar 27 01:36:10 CEST 2012


On 26/03/12 6:36 AM, Michael Turner wrote:
>> I really can't understand why should parsing be slower than reading from HDD =)
>
> Are you converting the ASCII-coded floating point numbers to actual
> floating point? That's actually quite a lot more overhead per
> character than ... well, anything else I can think of in processing a
> CSV file.
>
> And what do these numbers look like? Do they repeat? Are they short?
> Or are they high-precision and varying wildly in order of magnitude,
> and widely distributed statistically?

I see where you're heading ... by the look of those 4-dig-digit numbers 
the FP conversion could be done by a lookup in a 10,000 element array - 
assuming this is cheaper than the straightforward conversion.

--Toby

>
> -michael turner
>
>
>
> On Mon, Mar 26, 2012 at 5:37 PM, Max Lapshin<max.lapshin@REDACTED>  wrote:
>> On Mon, Mar 26, 2012 at 12:33 PM, Robert Melton<rmelton@REDACTED>  wrote:
>>>
>>> Agreed.  Do we have any baseline implementation in pure C or (insert
>>> fastest language/implementation you are aware of)?  I am working on
>>> speeding this up (and having a lot of fun!), but I have no idea the
>>> theory-craft maximum process speed (with proper escaping, etc) on my
>>> hardware.
>>>
>>
>> I really can't understand why should parsing be slower than reading from HDD =)
>>
>> However, it is slower. Currently I have 950 ms for 300K line CSV with
>> 40 float columns when read on cold system and 820 ms when read from
>> disk cache.
>>
>> Copying from kernel cache and byte-by-byte reading all data while
>> searching '\n' takes 100 ms (it is time of wc -l), so it takes about
>> 700 ms for erlang to parse + create all proper objects.
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>




More information about the erlang-questions mailing list