[erlang-questions] widefinder update

Wed Oct 24 00:55:07 CEST 2007

On 10/23/07, Anders Nygren <anders.nygren@REDACTED> wrote:
>
> On 10/23/07, Steve Vinoski <vinoski@REDACTED> wrote:
> > On 10/23/07, Anders Nygren <anders.nygren@REDACTED> wrote:
> > > To summarize my progress on the widefinder problem
> > > A few days ago I started with Steve Vinoski's tbray16.erl
> > > As a baseline on my 1.66 GHz dual core Centrino
> > > laptop, Linux,
> > > tbray16
> > > real    0m7.067s
> > > user     0m12.377s
> > > sys     0m0.584s
> >
> > Anders, thanks for collecting and posting these. I've just performed a
> set
> > of new timings for all of them, as listed below. For each, I just ran
> this
> > command:
> >
> > time erl -smp -noshell -run <test_case> main o1000k.ap >/dev/null
> >
> > where "<test_case>" is the name of the tbray test case file. All were
> looped
> > ten times, and I took the best timing for each. All tests were done on
> my
> > 8-core 2.33 GHz dual Intel Xeon with 2 GB RAM Linux box, in a local
> > (non-NFS) directory.
> >
>
> I don't keep track of the finer details of different CPUs, but I have
> a vague memory of that the 8 core Xeon is really 2 4 core CPUs
> on one chip, is that correct?

Yes, I believe so.

The reason I am asking is that I can not figure out why Your
> measurements have shorter real times than mine, but more
> than twice the user time.

It's because the user time includes CPU time on all the cores. More cores,
and more things happening on those cores, means more CPU time and thus more
user time. Tim saw the same phenomenon on his T5120 and blogged about it
here:

<http://www.tbray.org/ongoing/When/200x/2007/10/09/Niagara-2-T2-T5120>

Also it does not seems to scale so well up to 8 cores.
> Steve's best time is 0m1.546s an mine was 0m1.992s.

The default settings in the code are probably not ideal for the 8-core box.

Steve, can You also do some tests on tbray_blockread using
> different numbers of worker processes. Since smaller block
> size means that we start using all the cores earlier.

I ran a series of tests of different block sizes, and I found that for the 8
core, dividing the file into 1024 chunks (for this file, this means a block
size of 230606 bytes) produced the best time:

real    0m1.103s
user    0m6.651s
sys     0m0.492s

Which is pretty darn fast. :-) Smaller chunk sizes are slower probably
because there's more result collecting and merging to do, while larger chunk
sizes are slower because parallelism is reduced.

I can't wait to see this thing run on Tim's T5120.

BTW, I got a comment on my blog today from someone who essentially said I
was making Erlang look bad by applying it to a problem for which it's not a
good fit. My response was that I didn't agree; Tim's original goal was to
maximize the use of a multicore system for solving the Wide Finder, and
Erlang now does that better than anything else I've seen so far. Does anyone
in the Erlang community agree with the person who made that comment that
this Wide Finder project has made Erlang look bad?

--steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071023/5b165529/attachment.htm>