[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Mon Mar 26 14:16:28 CEST 2012

Hi Dmitry,

On 25 March 2012 11:09, Dmitry Kolesnikov <dmkolesnikov@REDACTED> wrote:
> Hello,
>
> 1. Yes, I'd like to admit that I've missed the example file from the beginning of thread. I've put my sets generators into the repository. You can use them like this:
> sh ./priv/dset line-ex.txt 1000 > ./priv/set-1M-ex.txt
> The first parameter is line, second is kilolines to generate. line-16/48 are my original examples, line-ex.txt is from original example.
>

What is line-ex.txt supposed to contain - is it a single line from the
original example? - I can't find it in the repo.

> 2. I am a bit of curious why we are getting results of different magnitude, especially with Robert's run. My HW config is pretty close. So, I've build Erlang R15B from sources with following config:
> ./configure --prefix=/usr/local/otp-R15B --enable-threads --enable-smp-support --enable-kernel-poll --enable-sctp --enable-hipe --disable-dynamic-ssl-lib --enable-darwin-64bit --enable-m64-build --without-javac

I'll rebuild that way, but essentially this is my build config looked
something like this (from config.log):

configure:4674: running /bin/sh
'/Users/t4/.kerl/builds/r15b-64-hipe-smp/otp_src_R15B/lib/configure'
--prefix=/usr/local  '--cache-file=/dev/null' '--enable-hipe'
'--enable-darwin-64bit' '--enable-threads' '--enable-smp'
'--enable-kernel-poll'
'ERL_TOP=/Users/t4/.kerl/builds/r15b-64-hipe-smp/otp_src_R15B'
--cache-file=/dev/null
--srcdir=/Users/t4/.kerl/builds/r15b-64-hipe-smp/otp_src_R15B/lib

>
> 3. I've re-run cases again with/without +native flag results are very interesting, so we can parse 300K lines in less then second, less the second for 1M rows is challenging:
>
> Parse time of data set is 300K:
> set-300K-16     653 ms / 2.18 us per line
> set-300K-48   1832 ms / 6.11 us per line
> set-300K-ex   2561 ms / 8.54 us per line
>
> Parse time of data set is 300K +native:
> set-300K-16     277 ms / 0.92 us per line
> set-300K-48     672 ms / 2.24 us per line
> set-300K-ex     925 ms / 3.09 us per line
>

I can't replicate this even with +native turned on (for both the csv
modules and the csv_example) and I'm bemused as you don't have 'that
much more' horsepower in your 4 cores than my mbpro's 2. Here's what I
get for parsing the original file - as you would expect, there is very
little difference in setting the segments to any number higher than
the # cpus:

5> csv_example:import("example.csv", 40).
size (MB): 74.958283
read (ms): 109.798000
parse (ms): 3239.264000
266257
6> csv_example:import("example.csv", 80).
size (MB): 74.958283
read (ms): 105.949000
parse (ms): 3231.541000
270352
7> csv_example:import("example.csv", 4).
size (MB): 74.958283
read (ms): 102.250000
parse (ms): 3275.317000
274450
8> csv_example:parse("example.csv", 4).
lines: 300001
size (MB): 74.958283
read (ms): 102.768000
parse (ms): 2737.098000
per line (us): 9.123630
ok
9> csv_example:parse("example.csv", 44).
lines: 300001
size (MB): 74.958283
read (ms): 106.153000
parse (ms): 2775.781000
per line (us): 9.252572
ok
10> csv_example:parse("example.csv", 2).
lines: 300001
size (MB): 74.958283
read (ms): 104.410000
parse (ms): 2758.367000
per line (us): 9.194526
ok
11> csv_example:import("example.csv", 2).
size (MB): 74.958283
read (ms): 108.705000
parse (ms): 3390.453000
278547

How did you build Max's example file? I'm really struggling to
understand how I've got 2 seconds more processing time for such a
similar setup.

> Parse time of data set is 1M:
> set-300K-16   4406 ms / 2.20 us per line
> set-300K-48   6076 ms / 6.08 us per line
> set-300K-ex   8670 ms / 8.67 us per line
>
> Parse time of data set is 1M +native:
> set-300K-16    1908 ms / 0.95 us per line
> set-300K-48    2293 ms / 2.29 us per line
> set-300K-ex    3119 ms / 3.12 us per line
>
> 4. It might be unfair to comment but I have a good feeling that Max is evaluating/challenging a type of trade system ;-) Otherwise, why you have such hard latency requirements and dataset looks like a price snapshot ;-) IMHO, 2 - 4 us parse time per row is acceptable for "normal" web app.

I agree your numbers seem perfectly reasonable, but I'm not able to
reproduce them. I'm going to try on a 2 core linux machine later on
and see how I get on.

>
> Regards, Dmitry
>
> On Mar 25, 2012, at 5:57 AM, Tim Watson wrote:
>
>> On 25 Mar 2012, at 03:54, Tim Watson wrote:
>>> Please also do run this code on some other (faster) machines to see how much of the slowness is specific to my machine.
>>
>> Oh and can you please pass on the set and set2 text files so I can test them on my machine to baseline the comparative differences between our environments?
>>
>> Cheers!
>>
>> Tim
>