[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Michael Turner michael.eugene.turner@REDACTED
Mon Mar 26 12:36:28 CEST 2012


> I really can't understand why should parsing be slower than reading from HDD =)

Are you converting the ASCII-coded floating point numbers to actual
floating point? That's actually quite a lot more overhead per
character than ... well, anything else I can think of in processing a
CSV file.

And what do these numbers look like? Do they repeat? Are they short?
Or are they high-precision and varying wildly in order of magnitude,
and widely distributed statistically?

-michael turner



On Mon, Mar 26, 2012 at 5:37 PM, Max Lapshin <max.lapshin@REDACTED> wrote:
> On Mon, Mar 26, 2012 at 12:33 PM, Robert Melton <rmelton@REDACTED> wrote:
>>
>> Agreed.  Do we have any baseline implementation in pure C or (insert
>> fastest language/implementation you are aware of)?  I am working on
>> speeding this up (and having a lot of fun!), but I have no idea the
>> theory-craft maximum process speed (with proper escaping, etc) on my
>> hardware.
>>
>
> I really can't understand why should parsing be slower than reading from HDD =)
>
> However, it is slower. Currently I have 950 ms for 300K line CSV with
> 40 float columns when read on cold system and 820 ms when read from
> disk cache.
>
> Copying from kernel cache and byte-by-byte reading all data while
> searching '\n' takes 100 ms (it is time of wc -l), so it takes about
> 700 ms for erlang to parse + create all proper objects.
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions



More information about the erlang-questions mailing list