[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Mon Mar 26 08:59:52 CEST 2012

On 26 March 2012 04:00, Michael Turner <michael.eugene.turner@REDACTED> wrote:
>> The kind of lines in the generated example.csv BTW look like this:
>>
>> KEY1,20120201,12:15:44.543,34.28,54.74,16.74,88.51,32.48,15.7,54.19,71.52,69.5,55.35,3.9,20.08,33.83,63.43,12.4,9.66,0.29,59.61,52.94,82.49,78.96,70.52,55.73,79.37,61.25,54.19,49.31,14.14,40.18,21.39,82.26,40.79,36.57,86.14,39.58,28.3,20.1,24.07,51.35,8.38,zz
>
> Which made me wonder: is Max actually trying to parse a CSV file of
> *floating point* numbers, or just numbers that *happen* to be
> expressed in floating point notation but that can be far more narrowly
> characterized? If you have 1 million lines of positive 4-digit fixed
> point numbers with only two significant digits past the decimal point
> (as the above would suggest), the numbers will repeat (on average)
> about 100 times in the file. So you could at least save yourself a
> factor of 100 in the text-to-float conversion part of the problem.
>
> This *general* problem of parsing CSV is important too, of course. But
> the *general* solution could also (FTW) admit of such approaches in
> its API.
>

Hi michael. Yes I agree with this and in my own experiments (and in
Dmitry's library) there is no data type conversion going on. Being
able to parse a 300k line CSV in around a second is a pretty good
achievement (once it's stable) and if turing off any automatic
to_int/to_float conversion is necessary to save a few seconds of
processing time, then I think most users will be happy with that.

As you say, Max seems to have some more specialised requirements of
the system in his 1 million rows in 1 second (on commodity hardware?)
goal.

> -michael turner