[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Mon Mar 26 08:50:57 CEST 2012

Interesting. Like Max said, would you update or post the script you're
using to test Max's inputs please, as I can't reproduce on a
comparable system. I've tried Macbook Pro (Intel Core 2 Duo, 3.6GHz,
8Gb Ram, twice the L2 cache) and also on an HP with Intel Core i7 8Gb
fast RAM and similar disk profile. I think the difference between our
results has to do with how the output is captured and that's what I'd
like to see (and be able to reproduce) - is this where you're
utilising an ets table yes? That sounds like a sensible approach.

Cheers,
Tim

On 25 March 2012 22:19, Dmitry Kolesnikov <dmkolesnikov@REDACTED> wrote:
> Hello Tim,
>
> 1. Yes, you are right about nested/escaped delimiters. They have to be fixed.
>
> 2. Reference platform:
>     * MacMini, Lion Server,
>     * 1x Intel Core i7 (2 GHz), 4x cores
>     * L2 Cache 256KB per core
>     * L3 Cache 6MB
>     * Memory 4GB 1333 MHZ DDR3
>     * Disk 750GB 7200rpm WDC WD7500BTKT-40MD3T0
>     * erlang R15B + native build of the library
>
> 3. I've tried to improve the documentation as part of README & csv.erl file. Please find them in the project: https://github.com/fogfish/csv
>
> Then, I have put extra example in csv_example.erl to show how to perform intake procedure to ets table.
>
> - Dmitry
>
>
> On Mar 25, 2012, at 5:52 PM, Tim Watson wrote:
>
>> Thanks for posting this detail Dmitry, it's very helpful. I'm going to
>> recompile with +native and give it a go. I also note that you're not
>> dealing with nested delimiters (e.g., where a field is enclosed in
>> single or double "quotation marks" and can therefore contain the
>> delimiters). If I can get the speed up to what you're seeing, then I
>> might just send you a couple of pull requests. :)
>>
>> Also I'm interested to know how you called the csv:parse/3 function
>> (or pparse instead) for these examples? Are you, for example,
>> accumulating the results, or are you just counting the occurrences?
>> Please let me know so I can replicate the same setup and test +native
>> for myself.
>>
>> Also would you be kind enough to let us all know what kind of hardware
>> your mac mini has, especially in terms of CPU and available memory?
>>
>> All in all, if we can clear up the differences I think parsing 300k in
>> just under a second without having to resort to a NIF is a very good
>> result and makes this the fastest csv parsing facility we've seen so
>> far!!!
>>
>> On 25 March 2012 11:09, Dmitry Kolesnikov <dmkolesnikov@REDACTED> wrote:
>>> Hello,
>>>
>>> 1. Yes, I'd like to admit that I've missed the example file from the beginning of thread. I've put my sets generators into the repository. You can use them like this:
>>> sh ./priv/dset line-ex.txt 1000 > ./priv/set-1M-ex.txt
>>> The first parameter is line, second is kilolines to generate. line-16/48 are my original examples, line-ex.txt is from original example.
>>>
>>> 2. I am a bit of curious why we are getting results of different magnitude, especially with Robert's run. My HW config is pretty close. So, I've build Erlang R15B from sources with following config:
>>> ./configure --prefix=/usr/local/otp-R15B --enable-threads --enable-smp-support --enable-kernel-poll --enable-sctp --enable-hipe --disable-dynamic-ssl-lib --enable-darwin-64bit --enable-m64-build --without-javac
>>>
>>> 3. I've re-run cases again with/without +native flag results are very interesting, so we can parse 300K lines in less then second, less the second for 1M rows is challenging:
>>>
>>> Parse time of data set is 300K:
>>> set-300K-16     653 ms / 2.18 us per line
>>> set-300K-48   1832 ms / 6.11 us per line
>>> set-300K-ex   2561 ms / 8.54 us per line
>>>
>>> Parse time of data set is 300K +native:
>>> set-300K-16     277 ms / 0.92 us per line
>>> set-300K-48     672 ms / 2.24 us per line
>>> set-300K-ex     925 ms / 3.09 us per line
>>>
>>> Parse time of data set is 1M:
>>> set-300K-16   4406 ms / 2.20 us per line
>>> set-300K-48   6076 ms / 6.08 us per line
>>> set-300K-ex   8670 ms / 8.67 us per line
>>>
>>> Parse time of data set is 1M +native:
>>> set-300K-16    1908 ms / 0.95 us per line
>>> set-300K-48    2293 ms / 2.29 us per line
>>> set-300K-ex    3119 ms / 3.12 us per line
>>>
>>> 4. It might be unfair to comment but I have a good feeling that Max is evaluating/challenging a type of trade system ;-) Otherwise, why you have such hard latency requirements and dataset looks like a price snapshot ;-) IMHO, 2 - 4 us parse time per row is acceptable for "normal" web app.
>>>
>>> Regards, Dmitry
>>>
>>> On Mar 25, 2012, at 5:57 AM, Tim Watson wrote:
>>>
>>>> On 25 Mar 2012, at 03:54, Tim Watson wrote:
>>>>> Please also do run this code on some other (faster) machines to see how much of the slowness is specific to my machine.
>>>>
>>>> Oh and can you please pass on the set and set2 text files so I can test them on my machine to baseline the comparative differences between our environments?
>>>>
>>>> Cheers!
>>>>
>>>> Tim
>>>
>