[erlang-questions] widefinder update

Wed Oct 24 01:42:28 CEST 2007

On 10/23/07, Steve Vinoski <vinoski@REDACTED> wrote:
> On 10/23/07, Anders Nygren <anders.nygren@REDACTED> wrote:
> > On 10/23/07, Steve Vinoski <vinoski@REDACTED> wrote:
> > > On 10/23/07, Anders Nygren <anders.nygren@REDACTED> wrote:
> > > > To summarize my progress on the widefinder problem
> > > > A few days ago I started with Steve Vinoski's tbray16.erl
> > > > As a baseline on my 1.66 GHz dual core Centrino
> > > > laptop, Linux,
> > > > tbray16
> > > > real    0m7.067s
> > > > user     0m12.377s
> > > > sys     0m0.584s
> > >
> > > Anders, thanks for collecting and posting these. I've just performed a
> set
> > > of new timings for all of them, as listed below. For each, I just ran
> this
> > > command:
> > >
> > > time erl -smp -noshell -run <test_case> main o1000k.ap >/dev/null
> > >
> > > where "<test_case>" is the name of the tbray test case file. All were
> looped
> > > ten times, and I took the best timing for each. All tests were done on
> my
> > > 8-core 2.33 GHz dual Intel Xeon with 2 GB RAM Linux box, in a local
> > > (non-NFS) directory.
> > >
> >
> > I don't keep track of the finer details of different CPUs, but I have
> > a vague memory of that the 8 core Xeon is really 2 4 core CPUs
> > on one chip, is that correct?
>
> Yes, I believe so.
> > The reason I am asking is that I can not figure out why Your
> > measurements have shorter real times than mine, but more
> > than twice the user time.
>
> It's because the user time includes CPU time on all the cores. More cores,
> and more things happening on those cores, means more CPU time and thus more
> user time. Tim saw the same phenomenon on his T5120 and blogged about it
> here:
>

But user time is supposed to be the time used executing instructions
for the process and its children, i.e. the CPU time used to solve
the task. So the user time should ideally remain constant when
more cores are added, and the real time should ideally be divided
by the number of cores.

But also Tim said
"Further poking dug up the answer: it seems that the hardware doesn't
tell the OS how it's sharing out the cycles among the the threads that
it has runnable at any point in time. So Solaris just credits them
with user CPU time whenever they're in Run state. The results will be
correct when you have up to sixteen threads staying runnable; above
that they get funky. "

So basically You can not trust the user time on the T1 or T2. But I
don't think that also applies on other processors.

> <http://www.tbray.org/ongoing/When/200x/2007/10/09/Niagara-2-T2-T5120>
> > Also it does not seems to scale so well up to 8 cores.
> > Steve's best time is 0m1.546s an mine was 0m1.992s .
>
> The default settings in the code are probably not ideal for the 8-core box.
> > Steve, can You also do some tests on tbray_blockread using
> > different numbers of worker processes. Since smaller block
> > size means that we start using all the cores earlier.
>
>
> I ran a series of tests of different block sizes, and I found that for the 8
> core, dividing the file into 1024 chunks (for this file, this means a block
> size of 230606 bytes) produced the best time:

Yes, I also got the impression that there was a optimum around
200k blocksize. But the sample to sample variation is enough that
I was not sure.

>
>
> real    0m1.103s
> user    0m6.651s
> sys     0m0.492s
>
> Which is pretty darn fast. :-) Smaller chunk sizes are slower probably
> because there's more result collecting and merging to do,

There is no merging or collecting of results in the ets based versions.
I think the slowdown for even smaller blocks is because of the
scheduling of, and switching between, all the worker processes.

>while larger chunk
> sizes are slower because parallelism is reduced.
>
> I can't wait to see this thing run on Tim's T5120.
>
> BTW, I got a comment on my blog today from someone who essentially said I
> was making Erlang look bad by applying it to a problem for which it's not a
> good fit. My response was that I didn't agree; Tim's original goal was to
> maximize the use of a multicore system for solving the Wide Finder, and
> Erlang now does that better than anything else I've seen so far. Does anyone
> in the Erlang community agree with the person who made that comment that
> this Wide Finder project has made Erlang look bad?

I think Tim was unnecessarily strong in his initial comments.
He did a naive beginner solution and when it performed badly he
said the Erlang sucks, (more or less). Instead of asking how to
improve it.

It has been an interesting exercise, and quite useful for me since
I am currently looking at a system that needs to process ~ 1 Tera
byte of log files per day. Fortunately they are BER coded :)

/Anders