<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hello everybody,<div><br></div><div>While we are talking on csv parsers, I've been working on one for which the priority is not raw speed but to be able to parse huge files in a stream (if your csv is 60G big, no worry), or csv sent over any wire (http, whatever), and using a callback to both process each line and a general state for the whole parsing.</div><div><br></div><div>You can take a look at <a href="https://github.com/refuge/ecsv">https://github.com/refuge/ecsv</a> . It's actually based on a 3 states machine</div><div>The main goal was to be able to parse a stream and dispatch the processing across many processes.</div><div>So far it's able to parse double-quoted field and different delimiters and is written in 100% erlang.</div><div><br></div><div>The last benchmark I did though was showing something like 1700 row/s ( example here <a href="http://friendpaste.com/58s9VAVswaczu4Zav49GEq">http://friendpaste.com/58s9VAVswaczu4Zav49GEq</a> ).</div><div>But again, my goal was to withstand bug files through a stream.</div><div><br></div><div>Feel free to take a look at it.</div><div><br></div><div>Thank you,</div><div><br></div><div>Nicolas Dufour</div><div><br><div><div>On Mar 25, 2012, at 10:52 AM, Tim Watson wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Thanks for posting this detail Dmitry, it's very helpful. I'm going to<br>recompile with +native and give it a go. I also note that you're not<br>dealing with nested delimiters (e.g., where a field is enclosed in<br>single or double "quotation marks" and can therefore contain the<br>delimiters). If I can get the speed up to what you're seeing, then I<br>might just send you a couple of pull requests. :)<br><br>Also I'm interested to know how you called the csv:parse/3 function<br>(or pparse instead) for these examples? Are you, for example,<br>accumulating the results, or are you just counting the occurrences?<br>Please let me know so I can replicate the same setup and test +native<br>for myself.<br><br>Also would you be kind enough to let us all know what kind of hardware<br>your mac mini has, especially in terms of CPU and available memory?<br><br>All in all, if we can clear up the differences I think parsing 300k in<br>just under a second without having to resort to a NIF is a very good<br>result and makes this the fastest csv parsing facility we've seen so<br>far!!!<br><br>On 25 March 2012 11:09, Dmitry Kolesnikov <<a href="mailto:dmkolesnikov@gmail.com">dmkolesnikov@gmail.com</a>> wrote:<br><blockquote type="cite">Hello,<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">1. Yes, I'd like to admit that I've missed the example file from the beginning of thread. I've put my sets generators into the repository. You can use them like this:<br></blockquote><blockquote type="cite">sh ./priv/dset line-ex.txt 1000 > ./priv/set-1M-ex.txt<br></blockquote><blockquote type="cite">The first parameter is line, second is kilolines to generate. line-16/48 are my original examples, line-ex.txt is from original example.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">2. I am a bit of curious why we are getting results of different magnitude, especially with Robert's run. My HW config is pretty close. So, I've build Erlang R15B from sources with following config:<br></blockquote><blockquote type="cite">./configure --prefix=/usr/local/otp-R15B --enable-threads --enable-smp-support --enable-kernel-poll --enable-sctp --enable-hipe --disable-dynamic-ssl-lib --enable-darwin-64bit --enable-m64-build --without-javac<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">3. I've re-run cases again with/without +native flag results are very interesting, so we can parse 300K lines in less then second, less the second for 1M rows is challenging:<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Parse time of data set is 300K:<br></blockquote><blockquote type="cite">set-300K-16 653 ms / 2.18 us per line<br></blockquote><blockquote type="cite">set-300K-48 1832 ms / 6.11 us per line<br></blockquote><blockquote type="cite">set-300K-ex 2561 ms / 8.54 us per line<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Parse time of data set is 300K +native:<br></blockquote><blockquote type="cite">set-300K-16 277 ms / 0.92 us per line<br></blockquote><blockquote type="cite">set-300K-48 672 ms / 2.24 us per line<br></blockquote><blockquote type="cite">set-300K-ex 925 ms / 3.09 us per line<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Parse time of data set is 1M:<br></blockquote><blockquote type="cite">set-300K-16 4406 ms / 2.20 us per line<br></blockquote><blockquote type="cite">set-300K-48 6076 ms / 6.08 us per line<br></blockquote><blockquote type="cite">set-300K-ex 8670 ms / 8.67 us per line<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Parse time of data set is 1M +native:<br></blockquote><blockquote type="cite">set-300K-16 1908 ms / 0.95 us per line<br></blockquote><blockquote type="cite">set-300K-48 2293 ms / 2.29 us per line<br></blockquote><blockquote type="cite">set-300K-ex 3119 ms / 3.12 us per line<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">4. It might be unfair to comment but I have a good feeling that Max is evaluating/challenging a type of trade system ;-) Otherwise, why you have such hard latency requirements and dataset looks like a price snapshot ;-) IMHO, 2 - 4 us parse time per row is acceptable for "normal" web app.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Regards, Dmitry<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">On Mar 25, 2012, at 5:57 AM, Tim Watson wrote:<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><blockquote type="cite">On 25 Mar 2012, at 03:54, Tim Watson wrote:<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Please also do run this code on some other (faster) machines to see how much of the slowness is specific to my machine.<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Oh and can you please pass on the set and set2 text files so I can test them on my machine to baseline the comparative differences between our environments?<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Cheers!<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Tim<br></blockquote></blockquote><blockquote type="cite"><br></blockquote>_______________________________________________<br>erlang-questions mailing list<br><a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>http://erlang.org/mailman/listinfo/erlang-questions<br></div></blockquote></div><br></div></body></html>