<div><span class="gmail_quote">On 9/24/07, <b class="gmail_sendername">Steve Vinoski</b> <<a href="mailto:vinoski@ieee.org">vinoski@ieee.org</a>> wrote:</span><blockquote class="gmail_quote" style="margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class="q"><div><span class="gmail_quote">On 9/23/07, <b class="gmail_sendername">Bob Ippolito</b> <<a href="mailto:bob@redivi.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">bob@redivi.com
</a>> wrote:</span><blockquote class="gmail_quote" style="margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex">
On 9/24/07, Patrick Logan <<a href="mailto:patrickdlogan@gmail.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">patrickdlogan@gmail.com</a>> wrote:<br>> > > > <a href="http://www.tbray.org/ongoing" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
http://www.tbray.org/ongoing</a>/When/200x/2007/09/22/Erlang
<br>> > > ><br>> > > > Tim Bray might raise some valid points here, even if he's slightly<br>> > > > biased by his background.<br>><br>> The good news is speeding up the i/o in erlang should be easier than
<br>> introducing better concurrency to another language.<br>><br><br>I've never had a problem with Erlang's general I/O performance, it's<br>probably just some implementation detail of direct file I/O that is
<br>the loser here. The obvious Erlang fast path to read lines is to spawn<br>cat and let the port machinery do all of the work for you. Here's an<br>example (including a copy of Tim's dataset):<br><br><a href="http://undefined.org/erlang" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
http://undefined.org/erlang</a>/o10k.zip<br></blockquote></div><br></span><div>I posted a link in a comment to Tim's blog to an example that uses multiple processes to break down the expensive parts of processing Tim's dataset in parallel, and was able to achieve a pure Erlang approach that on my MacBook Pro equals your "cat" approach, and is much faster than "cat" on an 8-core machine. It's shown on my blog:
</div><div><br> </div><div><<a href="http://steve.vinoski.net/blog/2007/09/23/tim-bray-and-erlang/" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://steve.vinoski.net/blog/2007/09/23/tim-bray-and
-erlang/</a>></div><div><br>
</div><div>It definitely speeds up as the number of cores goes up.</div><div><br> </div><div>I don't consider myself an Erlang expert and so welcome any suggestions for improving this. I'm guessing someone will see the two instances of "++" list handling and jump on that, but I tried it with the typical reverse approach and with flattening and neither was faster. However I am quite open to being enlightened. :-)
</div></blockquote><div><br class="webkit-block-placeholder"></div><div>Just a follow-up: a couple people have mentioned that I must be missing the fact that Tim's sample dataset in 100 times smaller than the real dataset. No, I'm not missing that, as I explained in a comment on my blog.
</div><div><br class="webkit-block-placeholder"></div><div>I think it's obvious that any solution that counts on reading in the whole file at once, like mine does, will have trouble with the full dataset. For that, I think a combination of Bob's "cat port" and my multiprocess line analysis would yield the best of both worlds.
</div><div><br class="webkit-block-placeholder"></div><div>--steve </div><br> </div>