[erlang-questions] widefinder analysis

Tue Nov 6 00:30:46 CET 2007

Observations:

1) All tested input filesizes should be cacheable by the kernel.
- only the first time it is read it has to be read from the hard drive.

compare with [647 MB on my 2 GB Athlon]
# time md5sum openSUSE-10.3-GM-Addon-Lang-x86_64.iso

real    0m18.749s
user    0m3.828s
sys     0m1.044s

# time md5sum openSUSE-10.3-GM-Addon-Lang-x86_64.iso
real    0m3.234s
user    0m2.440s
sys     0m0.508s

# time md5sum openSUSE-10.3-GM-Addon-Lang-x86_64.iso
real    0m3.307s
user    0m2.768s
sys     0m0.500s

But doing the same on the DVD.iso can not be cached efficiently as it
is bigger than my RAM and will always take about 1m20.
# time md5sum openSUSE-10.3-GM-DVD-x86_64.iso

real    1m19.695s
user    0m25.338s
sys     0m8.249s

Reading tbray16.erl it looks like it might cause several in Erlang copies of 
the data. At least one full copy is made with
 bfile:fread(F, filelib:file_size(File))
but is not that later split into Blksize * schedulers = file_size in 
split_and_find?

=> recommended RAM size > 3 * input file size [for input data only!]

If it send the Bin and Offset and Size to wfbm4 then wfbm4 also could
search for beginning of string by itself (decreasing offset and increasing 
Size), as wbr4:find could be made to skip all non terminating lines.

=> recommended RAM size > 2 * input file size

Running "vmstat 1" while while running widefinder might show some
interesting information. [First run it might even do some swapping si/so
to make place for file cache, but after that it should not be necessary.
Second time there should be no io - "bi/bo" as the input file should fit in 
file cache.]

3) The processors have internal caches as well. Some processors share this 
cache others do not. Linux (and other OS) kernels tries to keep a 
process/thread on the same processor cache - but does Erlang try to keep
an Erlang process on the same thread?

/RogerL

On fredag 02 november 2007, Anders Nygren wrote:
> Hi
>
> After spending to much time on the widefinder problem
> I thought I should make a summary of the situation and
> try to get some help from help from VM and CPU gurus
> to figure out why it does not scale up properly with
> increasing number of CPU cores.
>
> The discussion here will be about my wfinder8.erl, that
> is attached, for anyone to test it. I would love to get
> some results from machines with more than 2 cores, and
> preferably some that are not XEON 4 or 8 cores.
>
> The basic algorithm
> -------------------------------
> The main process just checks the number of scheduler
> threads, and the size of the file. It then starts
> one worker process for each scheduler thread assigning
> it a part of the file to process. Then it just waits
> for the results from the workers.
>
> Each worker process
> 1- opens the file
> 2- reads a block of 200 k bytes
> 3- start a new sub worker process that processes the block
> 4- reads the next block
> 5- wait for the sub worker to terminate
> 6- goto 3
>
> Intuitively this feel like a reasonable way to solve
> the problem. But the results do not fit my intuition :)
>
> Results
> -------
> I have tested this on 3 different machines
> 1- my 1.66 GHz dual core centrino laptop
> 2- Caoyuan's 4-core XEON
> 3- Steve Vinoski's 8-core XEON
>
> An earlier version has been tried by Tim Bray on a
> Sun T2150. I have asked him to test the latest one
> too but I do not have any results yet.
>
>
> My laptop, 1.66GHz dual core centrino
> -------------------------------------
>
> 1million lines
> schedulers real   user   sys    total    total/real
> 1          1.731  1.588  0.132  1.72000   0.994 (1)
> 2          0.977  1.612  0.180  1.79200   1.83  (2)
>
> (1) total/real ~1, ;I take that to mean that I get
>     100% CPU usage, on one core.
> (2) total/real 1.83, ideally this should be 2, so we
>     are not utilizing the CPU to the max. But I suppose
>     that it is OK since the OS and other stuff on the
>     machine must get some CPU time also.
>     I run OpenSuSE 10.3/KDE and it seems to idle at
>     ~5% cpu load.
>
> Speedup from 1 to 2 schedulers 1.77, not too bad.
>
> Different file sizes
>
>                   Real Time (s)      Speedup    Throughput MB/s
> Lines  Bytes    1 sched   2 sched            1 sched   2 sched
> 10k    	 2M      0.169     0.167     1.01      11.83     11.97
> 100k    20M      0.305     0.240     1.27      65.57     83.33
> 1M     192M      1.643     0.999     1.64     116.86    192.19
> 2M     384M      3.235     1.888     1.71     118.70    203.39
> 4M     767M      6.118     3.269     1.87     125.37    234.63
>
> So it seems like the constant time for running this is ~ 0.165s
>
> Caoyuan's 4-CPU Intel Xeon 2.80GHz
> ----------------------------------
> 1million lines
> schedulers real   user   sys    total    total/real
> 1          1.351  1.128  0.224  1.352    1.00
> 2          0.836  1.336  0.208  1.544    1.85 (1)
> 4          0.796  2.156  0.468  2.624    3.30 (2)
>
> (1) The CPU utilization should be closer to two, since
>     there are free cpus/cores for other tasks on the
>     machine.
> (2) Two thing to take note of here. 1, The user and system
>     times have increased significantly. 2, The CPU
>     utilization is way to low.
>
> Speedup from
> 1 to 2 schedulers  1.62,  Similar to my laptop
> 2 to 4 schedulers  1.05,  Pathetic!!
> 1 to 4 schedulers  1.70   Not to impressive
>
> So why does it not perform better?
> We will see more of this below in the section about
> the 8 core XEON.
>
> Different file sizes
>
> File Size         Real Time (s)        Speedup       Throughput MB/s
> Lines Bytes  1 sched 2 sched 4 sched   1-2   2-4   1 sched 2 sched 4 sched
> 10k     2M    0.150   0.149   0.148    1.00  1.00   13.33   13.42   13.51
> 1M    192M    1.351   0.836   0.796    1.62  1.05  142.12  229.66  241.20
> 5M    926M    5.742   3.559   3.512    1.61  1.01  161.27  260.18  263.67
>
> Steve's 8-core XEON
> -------------------
> Dual 2.33 GHz Intel Xeon (8 cores total) with 8
> GB of RAM, a load average of 0.04, running RedHat Enterprise Linux 4.
> Erlang is R11B-5.
>
> 1,167,948 lines
> schedulers real   user   sys    total  total/real
> 1          1.277  1.049  0.243  1.292  1.01
> 2          1.124  1.795  0.334  2.129  1.89 (1)
> 4          0.810  1.936  0.759  2.695  3.33
> 8          0.724  2.429  1.402  3.831  5.29
>
> (1) Again a large increases in user and system times, this
>     time for each increase in schedulers.
>
> Does the fact that it is a dual CPU machine explain why
> the user and sys times increase so much when increasing
> the number of schedulers?
>
> Speedup from
> 1 to 2 schedulers  1.14
> 2 to 4 schedulers  1.38
> 4 to 8 schedulers  1.12
>
> 1 to 4 schedulers  1.58
> 1 to 8 schedulers  1.76
>
>
> Different file sizes
>
> File Size     Real Time (s)   Throughput MB/s
> Lines Bytes   8 sched         8 sched
> 10L     2M    0.135           14.81
> 1M    225M    0.64            351.56
> 2M    450M    1.017           442.48
> 4M    900M    1.732           519.63
> 8M   1800M    3.183           565.50
> 16M  3600M    5.998           600.20
>
> This machine really gets going when the files grows.
>
> Tim's Sun T2150
> ---------------
> This is based on Sun's new Niagara T2 processor
> 8 cores
> 2 integer units/core
> 8 threads/core
> 1 4 MB L2 cache
>
> Solaris and erlang sees this as an 64 processor machine.
>
> This data is for a different version of my program.
>
> The test was using a file with
> 4,625,236 lines
> 971,538,252 bytes
>
> real   user   system  total   total/real
> 6.46   34.07  8.02    42.09     6.52
>
> File size  Throughput MB/s
> 926M          143
>
> So again it seems like it is not able to use all available
> CPU cycles.
>
> Speculation
> -----------
> The speedup when going from 1 to 2 schedulers on a dual core
> machine was pretty good, 1.77.
>
> But on the 4 or 8 core machines the speedup from 1 to 4(8) cores
> was only about 1.7-1.78. Which is not very impressive.
>
> I really do not know enough about modern CPUs, but I will
> not let that stop me from making a wild guess.
>
> The Boyer-Moore search algorithm used in this case only
> inspects, on average, every twentieth byte.
>
> Is it possible that when there are 4, 8 or more cores
> sharing one cache, and all cores runs code that only
> looks at every twentieth byte that we get very poor
> cache utilization?
>
> So to test that I modified wfinder8.erl to count
> space and newline characters. And on Steve's 8 core
> machine we got the following speedups with different
> number of schedulers.
>
> cores speedup
> 1-2    1.08
> 2-4    1.72
> 4-8    1.52
>
> 1-4    1.86
> 1-8    2.83
>
> Are there any other things that can explain why it seems
> to scale so poorly?