[erlang-questions] How to quick calculation max Erlang's processes and scheduler can alive based on machine specs

Mon Jul 15 15:22:33 CEST 2019

On Sat, Jul 13, 2019 at 10:47 AM I Gusti Ngurah Oka Prinarjaya <
okaprinarjaya@REDACTED> wrote:

> Hi,
>
> I'm a super newbie, I had done very very simple parallel processing using
> erlang. I experimenting with my database containing about hundreds of
> thousands rows. I split the rows into different offsets then assign each
> worker-processes different rows based on offsets. For each row i doing
> simple similar text calculation using binary:longest_common_prefix/1
>
>
First, you need to recognize you have a parallelism problem, and not a
concurrency problem. So you are interested in what speedup you can get by
adding more cores, compared to a single-process solution. The key analysis
parameters are work, span and cost[0]. On top of that, you want to look at
the speedup factor (S = T_1 / T_p).

> 1. How to quick calculation / dumb / simple calculation max Erlang's
> processes based on above machine specs?
>
>
This requires measurement. A single-core/process system have certain
advantages:

* It doesn't need to lock and latch.
* It doesn't need to distribute data (scatter) and recombine data (gather).

Adding more processes has an overhead and at a point, it will cease to
provide speedup. In fact, speedup might go down.

What I tend to do, is to napkin math the cost of a process. The PCB I
usually set at 2048 bytes. It is probably lower in reality, but an upper
bound is nice. If each process has to keep, say, 4096 bytes of data around,
I set it at 2*4096 to account for the GC. So that is around 10 Kilobytes
per process. If I have a million processes, that is 10 gigabytes of memory.
If each process is also doing network I/O you need to account for the
network buffers in the kernel as well, etc. However, since you are looking
at parallelism, this has less importance since you don't want to keep a
process per row (the overhead tends to be too big in that case, and the
work is not concurrent anyway[1]).

> 2. The running time when doing similar text processing with 10 worker, or
> 20 worker or 40 worker was very blazingly fast. So i cannot feel, i cannot
> see the difference. How to measure or something like printing total minutes
> out? So i can see the difference.
>
>
timer:tc/1 is a good start. eministat[2] is a shameless plug as well.

> 3. How many scheduler need to active / available when i create 10
> processes? or 20 processes? 40 processes? and so on..
>
>
If your machine has 2 physical cores with two hyperthreads per core, a
first good ballpark is either 2 or 4 schedulers. Adding more just makes
them fight for the resources. The `+stbt` option might come in handy if
supported by your environment. Depending on your workload, you can expect
some -30 to 50% extra performance out of the additional hyperthread. In
some cases it hurts performance:

* Caches can be booted out by the additional hyperthread
* If you don't have memory presssure to make a thread wait, there is little
additional power in the hyperthread
* In a laptop environment, the additonal hyperthread will generate more
thermal heat. This might make the CPU clock down resulting in worse run
times. This is especially important on MacBooks. They have really miserable
thermals and add way too powerful CPUs in a bad thermal solution. It gives
them good peak performance when "sprinting" for short bursts, but bad
sustain performance, e.g., "marathons". Battery vs AC power also means a
lot and will mess with runtimes.

As for how many processes: you want to have enough to keep all your
schedulers utilized, but not so many your work is broken into tiny pieces.
This will mean more scatter/gather IO is necessary, impeding your
performance. And if that IO is going across CPU cores, you are also looking
at waiting on caches.

If you are really interested in parallel processing, it is probably better
to look at languages built for the problem space. Rust, with its rayon
library. Or something like https://futhark-lang.org/ might be better
suited. Or even look at TensorFlow. It has a really strong, optimized,
numerical core. Erlang, being bytecode interpreted, pays an overhead which
you have to balance out with either more productivity, ease of programming,
faster prototyping or the like. Erlang tends to be stronger at MIMD style
processing (and so does e.g., Go).

[0] https://en.wikipedia.org/wiki/Analysis_of_parallel_algorithms
[1] your work is classical SIMD rather than MIMD.
[2] github.com/jlouis/eministat

-- 
J.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190715/5596edba/attachment.htm>