[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Mon Mar 26 22:48:02 CEST 2012

Thanks for posting back Dmitry. I'll go through this exercise tomorrow
morning and let you know how I get on. For now, I need a little sleep.
:)

On 26 March 2012 21:13, Dmitry Kolesnikov <dmkolesnikov@REDACTED> wrote:
> Hello Tim, Max, et al,
>
> Let's start from beginning... looks like we are mixed a bit.
>
> 1. I've made a clean up in the repository. The folder ./priv contains a perl script  gen_set.pl to produce a data sets according to Max original format:
>
> As an example:
> key299991,20120326,21:06:31.543,24.16,92.39,22.68,1.71,43.50,53.29,90.53,10.05,91.01,80.66,23.09,18.55,41.38,98.90,61.31,40.44,14.26,42.23,78.22,54.78,18.86,11.72,97.45,47.39,61.50,39.98,73.25,51.65,9.84,42.33,18.84,23.52,60.11,94.82,55.87,86.04,25.12,20.40,32.69,4.51,zz
>
> 2. For historical reasons I run tests against three data sets, they differs each other by the number of fields (float numbers). They are
>  * 300K lines,  each line contains   8 fields, ~23MB
>  * 300K lines,  each line contains 24 fields, ~50MB
>  * 300K lines,  each line contains 40 fields, ~77MB <- Max's original file
>
> you can generate those data set's via
> perl priv/gen_set.pl 300   8 > priv/set-300K-8.txt
> perl priv/gen_set.pl 300 24 > priv/set-300K-24.txt
> perl priv/gen_set.pl 300 40 > priv/set-300K-40.txt
> or via make example if you are fun of GNU build like I am
>
> 2. I have made number of tests / examples to validate a performance. Essentially we are talking about ETL processes here:
>  * Extract data from csv file
>  * Transform csv lines into some other format
>  * Load data to some storage for processing
>
> The following test cases were implemented:
>  * extract data from CSV, this operation just parses a file and does nothing with data. The objective here is to validate a speed of the parser as such
>  * extract/transform, it parses a CSV file and calculates a rolling hash agains data
>  * extract/transform, it parses a CSV file and converts list into tuple. The tuple is not stored anywhere. The objective here is to validate speed of parser + transform operation
>  * extract/transform/load, it parses a CSV file, convers it to tuple and stores to some in-memory storage.
>
> You can find those cases at priv/csv_example.erl and run them by csv_example:run(80)
>
> 3. The results what I got are following (they are also available at README)
>
>   E/Parse         Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
>   -------------------------------------------------------------------
>   300K,  8 flds     23.41       91.722     350.000         1.16
>   300K, 24 flds     50.42      489.303     697.739         2.33
>   300K, 40 flds     77.43      780.296     946.003         3.15
>
>
>   ET/hash         Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
>   -------------------------------------------------------------------
>   300K,  8 flds     23.41       91.722     384.598         1.28
>   300K, 24 flds     50.42      489.303     761.414         2.54
>   300K, 40 flds     77.43      780.296    1047.329         3.49
>
>
>   ET/tuple         Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
>   -------------------------------------------------------------------
>   300K,  8 flds     23.41       91.722     228.306         0.76
>   300K, 24 flds     50.42      489.303     601.025         2.00
>   300K, 40 flds     77.43      780.296     984.676         3.28
>
>   ETL/ets          Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
>   -------------------------------------------------------------------
>   300K,  8 flds     23.41       91.722    1489.543         4.50
>   300K, 24 flds     50.42      489.303    2249.689         7.50
>   300K, 40 flds     77.43      780.296    2519.401         8.39
>
>   ETL/pts          Size (MB)   Read (ms)   Handle (ms)    Per Line (us)
>   -------------------------------------------------------------------
>   300K,  8 flds     23.41       91.722     592.886         1.98
>   300K, 24 flds     50.42      489.303    1190.745         3.97
>   300K, 40 flds     77.43      780.296    1734.898         5.78
>
>
> The biggest frustration came from ets table. It is was to slow to load data into ets. Now, I believe that your results are slow because you are doing ets load... In my final test, I swap ets with lightweight pts (process term storage), this is a process that holds data an addressable by key. You can see that Max's original file is parsed / transformed and loaded into in-memory process based storage pertty fast, just 5.78 us per line. My original file is even faster 1.98 us per line
>
> 4. If you wish to replicate the results on you HW then I propose for you to compile csv library, generate data sets and try to compile csv_example. Please let me know you you have a trouble on each of those phases. and keep in-ming that +native flag were used...
>
>
> Regards, Dmitry
>
>
>
> On Mar 26, 2012, at 3:16 PM, Tim Watson wrote:
>
>> Hi Dmitry,
>>
>> On 25 March 2012 11:09, Dmitry Kolesnikov <dmkolesnikov@REDACTED> wrote:
>>> Hello,
>>>
>>> 1. Yes, I'd like to admit that I've missed the example file from the beginning of thread. I've put my sets generators into the repository. You can use them like this:
>>> sh ./priv/dset line-ex.txt 1000 > ./priv/set-1M-ex.txt
>>> The first parameter is line, second is kilolines to generate. line-16/48 are my original examples, line-ex.txt is from original example.
>>>
>>
>> What is line-ex.txt supposed to contain - is it a single line from the
>> original example? - I can't find it in the repo.
>>
>>> 2. I am a bit of curious why we are getting results of different magnitude, especially with Robert's run. My HW config is pretty close. So, I've build Erlang R15B from sources with following config:
>>> ./configure --prefix=/usr/local/otp-R15B --enable-threads --enable-smp-support --enable-kernel-poll --enable-sctp --enable-hipe --disable-dynamic-ssl-lib --enable-darwin-64bit --enable-m64-build --without-javac
>>
>> I'll rebuild that way, but essentially this is my build config looked
>> something like this (from config.log):
>>
>> configure:4674: running /bin/sh
>> '/Users/t4/.kerl/builds/r15b-64-hipe-smp/otp_src_R15B/lib/configure'
>> --prefix=/usr/local  '--cache-file=/dev/null' '--enable-hipe'
>> '--enable-darwin-64bit' '--enable-threads' '--enable-smp'
>> '--enable-kernel-poll'
>> 'ERL_TOP=/Users/t4/.kerl/builds/r15b-64-hipe-smp/otp_src_R15B'
>> --cache-file=/dev/null
>> --srcdir=/Users/t4/.kerl/builds/r15b-64-hipe-smp/otp_src_R15B/lib
>>
>>>
>>> 3. I've re-run cases again with/without +native flag results are very interesting, so we can parse 300K lines in less then second, less the second for 1M rows is challenging:
>>>
>>> Parse time of data set is 300K:
>>> set-300K-16     653 ms / 2.18 us per line
>>> set-300K-48   1832 ms / 6.11 us per line
>>> set-300K-ex   2561 ms / 8.54 us per line
>>>
>>> Parse time of data set is 300K +native:
>>> set-300K-16     277 ms / 0.92 us per line
>>> set-300K-48     672 ms / 2.24 us per line
>>> set-300K-ex     925 ms / 3.09 us per line
>>>
>>
>> I can't replicate this even with +native turned on (for both the csv
>> modules and the csv_example) and I'm bemused as you don't have 'that
>> much more' horsepower in your 4 cores than my mbpro's 2. Here's what I
>> get for parsing the original file - as you would expect, there is very
>> little difference in setting the segments to any number higher than
>> the # cpus:
>>
>> 5> csv_example:import("example.csv", 40).
>> size (MB): 74.958283
>> read (ms): 109.798000
>> parse (ms): 3239.264000
>> 266257
>> 6> csv_example:import("example.csv", 80).
>> size (MB): 74.958283
>> read (ms): 105.949000
>> parse (ms): 3231.541000
>> 270352
>> 7> csv_example:import("example.csv", 4).
>> size (MB): 74.958283
>> read (ms): 102.250000
>> parse (ms): 3275.317000
>> 274450
>> 8> csv_example:parse("example.csv", 4).
>> lines: 300001
>> size (MB): 74.958283
>> read (ms): 102.768000
>> parse (ms): 2737.098000
>> per line (us): 9.123630
>> ok
>> 9> csv_example:parse("example.csv", 44).
>> lines: 300001
>> size (MB): 74.958283
>> read (ms): 106.153000
>> parse (ms): 2775.781000
>> per line (us): 9.252572
>> ok
>> 10> csv_example:parse("example.csv", 2).
>> lines: 300001
>> size (MB): 74.958283
>> read (ms): 104.410000
>> parse (ms): 2758.367000
>> per line (us): 9.194526
>> ok
>> 11> csv_example:import("example.csv", 2).
>> size (MB): 74.958283
>> read (ms): 108.705000
>> parse (ms): 3390.453000
>> 278547
>>
>> How did you build Max's example file? I'm really struggling to
>> understand how I've got 2 seconds more processing time for such a
>> similar setup.
>>
>>> Parse time of data set is 1M:
>>> set-300K-16   4406 ms / 2.20 us per line
>>> set-300K-48   6076 ms / 6.08 us per line
>>> set-300K-ex   8670 ms / 8.67 us per line
>>>
>>> Parse time of data set is 1M +native:
>>> set-300K-16    1908 ms / 0.95 us per line
>>> set-300K-48    2293 ms / 2.29 us per line
>>> set-300K-ex    3119 ms / 3.12 us per line
>>>
>>> 4. It might be unfair to comment but I have a good feeling that Max is evaluating/challenging a type of trade system ;-) Otherwise, why you have such hard latency requirements and dataset looks like a price snapshot ;-) IMHO, 2 - 4 us parse time per row is acceptable for "normal" web app.
>>
>> I agree your numbers seem perfectly reasonable, but I'm not able to
>> reproduce them. I'm going to try on a 2 core linux machine later on
>> and see how I get on.
>>
>>>
>>> Regards, Dmitry
>>>
>>> On Mar 25, 2012, at 5:57 AM, Tim Watson wrote:
>>>
>>>> On 25 Mar 2012, at 03:54, Tim Watson wrote:
>>>>> Please also do run this code on some other (faster) machines to see how much of the slowness is specific to my machine.
>>>>
>>>> Oh and can you please pass on the set and set2 text files so I can test them on my machine to baseline the comparative differences between our environments?
>>>>
>>>> Cheers!
>>>>
>>>> Tim
>>>
>