[erlang-questions] My frustration with Erlang

Sun Sep 14 10:39:40 CEST 2008

I think you have misunderstood my reasoning here. If you have 8 Erlang VMs
going, each with +S 1, how exactly does this defeat the purpose of having a
multi-core machine? Long before threading models came into vogue, multiple
processes were taking advantage of multi-CPU systems simply by letting the
OS scheduler choose which CPU on which to run the next runnable process.
Since Erlang threads are "green" threads, they don't individually use the
threading model of the underlying operating system anyway. Each VM, other
than for I/O operations on files (controleld by +A, I believe), uses AFAIK
one O/S thread per scheduler. So +S 8 will use 8 O/S threads. When you have
8 threads sharing something (which they will when running SMP), there is a
risk of contention slowing things down.

I wonder what happens when Erlang processes running on the scheduler thread
of one core send lots of messages to Erlang processes running on a scheduler
on a different core? There HAS to be a lock there somewhere while the
message moves from the memory space of the first process to the second one,
or if shared memory is being used (more likely scenario), there is going to
be some locking of shared data structures. Now if there are thousands of
processes sending in aggregate hundreds of thousands of messages, I believe
this will not scale well if the messages cross scheduler boundaries. I could
be wrong, but it fits the anecdotal data. It would be very interesting to
see the effect of designing an application so that processes that send lots
of messages to each other run on the same scheduler, sort of "clustering"
the messages so that they stay within the same VM's memory space. Using
multiple VMs with +S 1 would force this to happen because there IS only one
scheduler thread, so hopefully the VM doesn't use any SMP locks/mutexes
under those conditions. This is in contrast to assuming a uniform model
where the cost of sending messages is the same regardless of the fact that
it may be going to a different process space and incurring the cost of a
lock. The locks used are Posix semaphores, IIRC, and they are not cheap.

I guess what I am trying to say is that the basic assumption Erlang
applications seem to have been designed on top of up to now, knowing that
the cost of sending a message is extremely low, is perhaps not as true as it
used to be when using SMP. I would be very interested in hearing from Joe
Armstrong about this. I seem to recall that he wrote something about the
cost of what a process does needing to be much greater than the cost of
starting a process or sending a message to it in order to scale. This is
true of any IPC, even cheap IPC like Erlang. On top of that, things that use
shared structures like ETS heavily, for example Mnesia transactions, are
possibly going to suffer in the SMP context. All of this is just conjecture
based on my own work and anecdotal evidence presented by some others, but I
feel in my bones that there is something to this. Look at is this way:
Erlang got its major performance from eliminating locks. Running on a 1024
processor system in SMP mode and treating it like it's one big uniform
processor is going to backfire badly. I haven't seen much discussion about
this. Maybe it's too obvious to mention.

There is an interesting discussion (
http://www.erlang.org/pipermail/erlang-questions/2008-January/032273.html)
about assigning individual +S 1 Erlang VMs to a given CPU using processor
affinity. This could help considerably by leaving more of a VM's running
state cached in the processor code and data caches than if the VM's thread
were to be switched between CPUs by the OS.

I plan Real Soon Now ;-) to do some extensive research into this to see if
there is merit to it. Just as soon as I get over the hump in a current
project.

Anyway, hopefully this clarifies my thinking to you and you have less of an
issue with it.

On Sun, Sep 14, 2008 at 3:21 AM, Valentin Micic <valentin@REDACTED>wrote:

> I havent read the whole correspndance (it seems to be going on for a way
> too long), but like to add my 2c worth...
>
> While SMP (+S1) approach may solve some problems, it defeats the purpose of
> having a multi-core machine. Please note that multi-core machines have lower
> clock speeds, thus should run generally slower per given CPU core. IMHO, if
> +S 1 solves your problem, maybe you should revisit your code -- I think that
> it is wrong to expect that the same code would work better on SMP just
> because you had such expectations. For example, it is known fact that ETS
> works slower in SMP environment.
> Also, one should not forget to use +A in addition to +S -- although you do
> not have any disk I/O, I think this parameter is relevant for PORT
> scheduling, therefore improving performance of your I/O.
>
> V.
>
>
> ----- Original Message ----- From: "Kevin Scaldeferri" <
> kevin@REDACTED>
> To: "Edwin Fine" <erlang-questions_efine@REDACTED>
> Cc: "erlang-questions Questions" <erlang-questions@REDACTED>
> Sent: Sunday, September 14, 2008 12:07 AM
> Subject: Re: [erlang-questions] My frustration with Erlang
>
>
>
>> On Sep 13, 2008, at 1:56 PM, Edwin Fine wrote:
>>
>>  You'd probably have to partition the load to round-robin across the
>>> individual VMs, possibly using some front-end load-balancing
>>> hardware. This is why I keep harping on this: some time ago I put
>>> the system I am working on under heavy load to test the maximum
>>> possible throughput. There was no appreciable disk I/O. The kicker
>>> is that I did not see an even distribution of load across the 4
>>> cores of my box. In fact, it looked as if one or maybe two cores
>>> were being used at 100% and the rest were idle. When I re-ran the
>>> test on a whim, using only 1 non-SMP (+S 1) node, I actually got
>>> better performance.
>>>
>>> This seemed counter-intuitive and against the "Erlang SMP scales
>>> linearly for CPU-intensive loads" idea. I have not done a lot of
>>> investigation into this because I have other fish to fry right now,
>>> but the folks over at LShift (RabbitMQ) - assuming I did not
>>> misunderstand them - wrote that they had seen similar behavior when
>>> running clustered Rabbit nodes (i.e. better performance from N
>>> single-CPU nodes than N +S N nodes). However, they, like me, are not
>>> ready to come out and state this bluntly as a fact because (I
>>> believe) they feel not enough investigation has been done to make
>>> this a conclusive case.
>>>
>>
>> I've also been seeing similar behavior trying to parallelize the
>> alioth shootout code, fwiw.  I'd also say it's premature to draw any
>> concrete conclusions, but another anecdotal point.
>>
>> (Also, on the particular OS & hardware the benchmarks run on, the
>> total CPU usage nearly doubles for the parallel implementations.  On
>> my 2-core mac, though, I see no more than a 10% increase in total CPU
>> usage, and a near 100% improvement in the wall-time, as one should
>> expect on the embarrassingly parallel problems.  Dunno, if this is
>> related to the OS, the chip (Core 2 Duo vs Core 2 Quad), HiPE, or what.)
>>
>>
>> -kevin
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080914/1f8be348/attachment.htm>