[erlang-questions] Further SMP and distributed performance comments

Edwin Fine erlang-questions_efine@REDACTED
Sun Oct 12 21:20:47 CEST 2008


Sorry, this email got sent before it was complete, so if you get it twice,
my apologies. I never knew that, in Gmail, if you reply to an email while
busy with another, it seems to send the one you were busy with first. I
would have expected it to save to draft, but you live and learn.

>
> It is not at all surprising that the SMP version run much slower than
> the non SMP version.
> I looked at the program source and what I find there is an
> implementation that does not allow very much of
> parallell execution.
> The broker process is clearly a bottleneck since it is involved in
> everything. Every other process must wait for the broker process
> before it can continue it's execution.


Agreed.


>
> The other processes are also doing so little useful work so the
> task-switching and locking around the run-queue will become the
> dominating thing.


I also thought so - but more about that later.

When you have benchmarks with parallell processes that hardly perform
> any work and all processes are highly dependent on
> other processes, you can expect results like this, there is nothing
> wrong or bad with the ERlang SMP implementation because of that.


You are absolutely right that the benchmark is not representative of most
real-world situations. It does, however, show up some weakness in the
current Erlang SMP implementation. Other languages managed to run the same
benchmark in a reasonable amount of time, therefore the benchmark itself
cannot be totally invalid. Two orders of magnitude difference between
running non-SMP and SMP is excessive under any circumstances.

I tried some further experiments and got surprising results. I'd like to
hear your opinion on this.

I first hypothesized the following:

   1. If the locking of the shared run queue is the problem, then if you run
   multiple VMs (nodes) each with a non-shared run queue (i.e. non-SMP) the
   program will run faster.
   2. If the broker is the bottleneck, have multiple brokers and there
   should be a large improvement in performance.

Initially, I modified the chameneos benchmark to allow more flexibility.

   - I changed it into a distributed system that allows you to choose how
   many separate Erlang nodes on which to run. The nodes have to be pre-started
   manually, unfortunately, but I haven't gotten around to trying to start them
   from within the program.
   - I increased the number of brokers to one broker per node.
   - I changed the benchmark code to spawn processes evenly across the
   number of nodes chosen. I also set it up that there will be an even number
   of "chameneos" processes per node, because one of the scenarios I wanted to
   test is if there is no cross-node communication - the broker on a given node
   only communicates with chameneos on the same node. This is to test how
   inter-node communications compares to intra-node communications.
   - The benchmark now has an extra step that gathered the intermediate
   results from all the brokers to present the final result.
   - I put lots of print statements in the code to show progress.
   - Finally, I removed the initial test that runs with only 3 chameneos
   because I only wanted to test the worst-case scenario (10 chameneos). The
   benchmark figures are therefore no longer directly comparable to the results
   on the alioth web site, but may be compared to each other only.

The code I modified is now a bit of a mess because (a) I am not a highly
experienced Erlang programmer and (b) I was rushing to try a whole lot of
different things, so I was hacking it, but it does the intended job.

I started four Erlang shells (nodes), all on the same 4-core Q6600 system.
Initially I started the nodes with SMP enabled then I re-ran the tests with
SMP disabled.

I decided to test the following scenarios (parameters are [number of
iterations, number of nodes]):

   1. One Erlang SMP node, 1 broker, 10 chameneos - timer:tc(broker, start,
   [*6000000*, 1]) [353.18 secs].
   2. Four Erlang SMP nodes, 4 brokers, 10 chameneos - timer:tc(broker,
   start, [*6000000*, 4]). [12.42 secs]
   3. One Erlang non-SMP node, 1 broker, 10 chameneos - timer:tc(broker,
   start, [*6000000*, 1]) [6.08 secs].
   4. Four Erlang non-SMP nodes, 4 brokers, 10 chameneos - timer:tc(broker,
   start, [*6000000*, 4]). [1.55 secs]

Please note that the above results are for brokers that are constrained to
be on the *same* node as the chameneos with which they are communicating.
When this constraint is removed, that is, when brokers communicate with
chameneos on different nodes, it is much slower, even in smp-disabled mode.
To quantify that:

Four Erlang non-SMP nodes, 4 brokers, 10 chameneos,
*intra-node*communications - timer:tc(broker, start, [
*6000000*, 4]). *[1.55 secs]*
Four Erlang non-SMP nodes, 4 brokers, 10 chameneos,
*inter-node*communications - timer:tc(broker, start, [
*6000000*, 4]). *[192.6 secs]*

Maybe this is to be expected, but if so, why? Should it be 124 times faster
to communicate between Erlang nodes *on the same physical system* than it is
to communicate within the nodes only?

*Sample output of intra-node communication*
Started broker <6364.44.0> on cwork_4@REDACTED expecting 1500000 messages;
collector pid = <0.39.0>
Started broker <6363.44.0> on cwork_3@REDACTED expecting 1500000 messages;
collector pid = <0.39.0>
Started broker <6362.44.0> on cwork_2@REDACTED expecting 1500000 messages;
collector pid = <0.39.0>
Started broker <0.64.0> on cwork_1@REDACTED expecting 1500000 messages;
collector pid = <0.39.0>
Started chameneos <0.65.0> for color blue on cwork_1@REDACTED for broker
<0.64.0>
Started chameneos <0.66.0> for color red on cwork_1@REDACTED for broker
<0.64.0>
Started chameneos <6362.45.0> for color yellow on cwork_2@REDACTED for broker
<6362.44.0>
Started chameneos <6362.46.0> for color red on cwork_2@REDACTED for broker
<6362.44.0>
Started chameneos <6363.45.0> for color yellow on cwork_3@REDACTED for broker
<6363.44.0>
Started chameneos <6363.46.0> for color blue on cwork_3@REDACTED for broker
<6363.44.0>
Started chameneos <6364.45.0> for color red on cwork_4@REDACTED for broker
<6364.44.0>
Started chameneos <6364.46.0> for color yellow on cwork_4@REDACTED for broker
<6364.44.0>
Started chameneos <0.67.0> for color red on cwork_1@REDACTED for broker
<0.64.0>
Started chameneos <0.68.0> for color blue on cwork_1@REDACTED for broker
<0.64.0>

*Sample output showing inter-node communication*
Started broker <0.41.0> on cwork_1@REDACTED expecting 1500000 messages;
collector pid = <0.39.0>
Started broker <6346.44.0> on cwork_2@REDACTED expecting 1500000 messages;
collector pid = <0.39.0>
Started broker <6347.44.0> on cwork_3@REDACTED expecting 1500000 messages;
collector pid = <0.39.0>
Started broker <6348.44.0> on cwork_4@REDACTED expecting 1500000 messages;
collector pid = <0.39.0>
Started chameneos <0.53.0> for color blue on cwork_1@REDACTED for broker
<6348.44.0>
Started chameneos <0.54.0> for color red on cwork_1@REDACTED for broker
<6348.44.0>
Started chameneos <6346.45.0> for color yellow on cwork_2@REDACTED for broker
<6347.44.0>
Started chameneos <6346.49.0> for color red on cwork_2@REDACTED for broker
<6347.44.0>
Started chameneos <6347.45.0> for color yellow on cwork_3@REDACTED for broker
<6346.44.0>
Started chameneos <6347.49.0> for color blue on cwork_3@REDACTED for broker
<6346.44.0>
Started chameneos <6348.45.0> for color red on cwork_4@REDACTED for broker
<0.41.0>
Started chameneos <6348.46.0> for color yellow on cwork_4@REDACTED for broker
<0.41.0>
Started chameneos <0.55.0> for color red on cwork_1@REDACTED for broker
<6348.44.0>
Started chameneos <0.56.0> for color blue on cwork_1@REDACTED for broker
<6348.44.0>

Again, please excuse the ugly code.

To compile, just erlc broker.erl and chameneos.erl. No HiPE used in these
measurements.
To run, start nodes with the sname cwork_1, cwork_2, ... e.g.

$ erl +K true -sname cwork_1 -smp disable
- OR -
$ erl +K true -sname cwork_1 # to enable SMP

Then in cwork_1, enter:
> timer:tc(broker, start, [6000000,4]).

The 4 is the number of nodes expected (cwork_1 .. cwork_4).

Regards,
Edwin Fine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20081012/9fa25e1e/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: broker.erl
Type: text/x-erlang
Size: 6084 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20081012/9fa25e1e/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: chameneos.erl
Type: text/x-erlang
Size: 2004 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20081012/9fa25e1e/attachment-0001.bin>


More information about the erlang-questions mailing list