[erlang-questions] widefinder analysis

Fri Nov 2 23:58:50 CET 2007

Hi

After spending to much time on the widefinder problem
I thought I should make a summary of the situation and
try to get some help from help from VM and CPU gurus
to figure out why it does not scale up properly with
increasing number of CPU cores.

The discussion here will be about my wfinder8.erl, that
is attached, for anyone to test it. I would love to get
some results from machines with more than 2 cores, and
preferably some that are not XEON 4 or 8 cores.

The basic algorithm
-------------------------------
The main process just checks the number of scheduler
threads, and the size of the file. It then starts
one worker process for each scheduler thread assigning
it a part of the file to process. Then it just waits
for the results from the workers.

Each worker process
1- opens the file
2- reads a block of 200 k bytes
3- start a new sub worker process that processes the block
4- reads the next block
5- wait for the sub worker to terminate
6- goto 3

Intuitively this feel like a reasonable way to solve
the problem. But the results do not fit my intuition :)

Results
-------
I have tested this on 3 different machines
1- my 1.66 GHz dual core centrino laptop
2- Caoyuan's 4-core XEON
3- Steve Vinoski's 8-core XEON

An earlier version has been tried by Tim Bray on a
Sun T2150. I have asked him to test the latest one
too but I do not have any results yet.

My laptop, 1.66GHz dual core centrino
-------------------------------------

1million lines
schedulers real   user   sys    total    total/real
1          1.731  1.588  0.132  1.72000   0.994 (1)
2          0.977  1.612  0.180  1.79200   1.83  (2)

(1) total/real ~1, ;I take that to mean that I get
    100% CPU usage, on one core.
(2) total/real 1.83, ideally this should be 2, so we
    are not utilizing the CPU to the max. But I suppose
    that it is OK since the OS and other stuff on the
    machine must get some CPU time also.
    I run OpenSuSE 10.3/KDE and it seems to idle at
    ~5% cpu load.

Speedup from 1 to 2 schedulers 1.77, not too bad.

Different file sizes

                  Real Time (s)      Speedup    Throughput MB/s
Lines  Bytes    1 sched   2 sched            1 sched   2 sched
10k    	 2M      0.169     0.167     1.01      11.83     11.97
100k    20M      0.305     0.240     1.27      65.57     83.33
1M     192M      1.643     0.999     1.64     116.86    192.19
2M     384M      3.235     1.888     1.71     118.70    203.39
4M     767M      6.118     3.269     1.87     125.37    234.63

So it seems like the constant time for running this is ~ 0.165s

Caoyuan's 4-CPU Intel Xeon 2.80GHz
----------------------------------
1million lines
schedulers real   user   sys    total    total/real
1          1.351  1.128  0.224  1.352    1.00
2          0.836  1.336  0.208  1.544    1.85 (1)
4          0.796  2.156  0.468  2.624    3.30 (2)

(1) The CPU utilization should be closer to two, since
    there are free cpus/cores for other tasks on the
    machine.
(2) Two thing to take note of here. 1, The user and system
    times have increased significantly. 2, The CPU
    utilization is way to low.

Speedup from
1 to 2 schedulers  1.62,  Similar to my laptop
2 to 4 schedulers  1.05,  Pathetic!!
1 to 4 schedulers  1.70   Not to impressive

So why does it not perform better?
We will see more of this below in the section about
the 8 core XEON.

Different file sizes

File Size         Real Time (s)        Speedup       Throughput MB/s
Lines Bytes  1 sched 2 sched 4 sched   1-2   2-4   1 sched 2 sched 4 sched
10k     2M    0.150   0.149   0.148    1.00  1.00   13.33   13.42   13.51
1M    192M    1.351   0.836   0.796    1.62  1.05  142.12  229.66  241.20
5M    926M    5.742   3.559   3.512    1.61  1.01  161.27  260.18  263.67

Steve's 8-core XEON
-------------------
Dual 2.33 GHz Intel Xeon (8 cores total) with 8
GB of RAM, a load average of 0.04, running RedHat Enterprise Linux 4.
Erlang is R11B-5.

1,167,948 lines
schedulers real   user   sys    total  total/real
1          1.277  1.049  0.243  1.292  1.01
2          1.124  1.795  0.334  2.129  1.89 (1)
4          0.810  1.936  0.759  2.695  3.33
8          0.724  2.429  1.402  3.831  5.29

(1) Again a large increases in user and system times, this
    time for each increase in schedulers.

Does the fact that it is a dual CPU machine explain why
the user and sys times increase so much when increasing
the number of schedulers?

Speedup from
1 to 2 schedulers  1.14
2 to 4 schedulers  1.38
4 to 8 schedulers  1.12

1 to 4 schedulers  1.58
1 to 8 schedulers  1.76

Different file sizes

File Size     Real Time (s)   Throughput MB/s
Lines Bytes   8 sched         8 sched
10L     2M    0.135           14.81
1M    225M    0.64            351.56
2M    450M    1.017           442.48
4M    900M    1.732           519.63
8M   1800M    3.183           565.50
16M  3600M    5.998           600.20

This machine really gets going when the files grows.

Tim's Sun T2150
---------------
This is based on Sun's new Niagara T2 processor
8 cores
2 integer units/core
8 threads/core
1 4 MB L2 cache

Solaris and erlang sees this as an 64 processor machine.

This data is for a different version of my program.

The test was using a file with
4,625,236 lines
971,538,252 bytes

real   user   system  total   total/real
6.46   34.07  8.02    42.09     6.52

File size  Throughput MB/s
926M          143

So again it seems like it is not able to use all available
CPU cycles.

Speculation
-----------
The speedup when going from 1 to 2 schedulers on a dual core
machine was pretty good, 1.77.

But on the 4 or 8 core machines the speedup from 1 to 4(8) cores
was only about 1.7-1.78. Which is not very impressive.

I really do not know enough about modern CPUs, but I will
not let that stop me from making a wild guess.

The Boyer-Moore search algorithm used in this case only
inspects, on average, every twentieth byte.

Is it possible that when there are 4, 8 or more cores
sharing one cache, and all cores runs code that only
looks at every twentieth byte that we get very poor
cache utilization?

So to test that I modified wfinder8.erl to count
space and newline characters. And on Steve's 8 core
machine we got the following speedups with different
number of schedulers.

cores speedup
1-2    1.08
2-4    1.72
4-8    1.52

1-4    1.86
1-8    2.83

Are there any other things that can explain why it seems
to scale so poorly?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wfinder8.erl
Type: text/x-erlang
Size: 6207 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071102/a885727a/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wfbm4_ets5.erl
Type: text/x-erlang
Size: 2748 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071102/a885727a/attachment-0001.bin>