[erlang-questions] My frustration with Erlang

Sun Sep 14 16:58:12 CEST 2008

For what it's worth, I basically agree with Edwin's reasoning: letting the cores run nodes rather than SMP/threads in essence means there will be "minimal" contention (mediated by the kernel) and "minimal" synchronization (mediated by sockets).

That said, te reasons for running threads rather than nodes might be:
- easier to setup and change (porting to 16 cores is "+S 16", or perhaps "+S 20")
- easier to load-balance
- less costly messages and cheaper table access than remote operations
- less memory used due to sharing inside VM
- perhaps better memory performance, again due to sharing

A further technical limitation of nodes, once pointed out to me by Sean Hinde, is that sockets can't be sent between nodes today.

Best,
Thomas

--- On Sun, 9/14/08, Edwin Fine <erlang-questions_efine@REDACTED> wrote:

> From: Edwin Fine <erlang-questions_efine@REDACTED>
> Subject: Re: [erlang-questions] My frustration with Erlang
> To: "Valentin Micic" <valentin@REDACTED>
> Cc: "erlang-questions Questions" <erlang-questions@REDACTED>
> Date: Sunday, September 14, 2008, 10:39 AM
> I think you have misunderstood my reasoning here. If you
> have 8 Erlang VMs
> going, each with +S 1, how exactly does this defeat the
> purpose of having a
> multi-core machine? Long before threading models came into
> vogue, multiple
> processes were taking advantage of multi-CPU systems simply
> by letting the
> OS scheduler choose which CPU on which to run the next
> runnable process.
> Since Erlang threads are "green" threads, they
> don't individually use the
> threading model of the underlying operating system anyway.
> Each VM, other
> than for I/O operations on files (controleld by +A, I
> believe), uses AFAIK
> one O/S thread per scheduler. So +S 8 will use 8 O/S
> threads. When you have
> 8 threads sharing something (which they will when running
> SMP), there is a
> risk of contention slowing things down.
> 
> I wonder what happens when Erlang processes running on the
> scheduler thread
> of one core send lots of messages to Erlang processes
> running on a scheduler
> on a different core? There HAS to be a lock there somewhere
> while the
> message moves from the memory space of the first process to
> the second one,
> or if shared memory is being used (more likely scenario),
> there is going to
> be some locking of shared data structures. Now if there are
> thousands of
> processes sending in aggregate hundreds of thousands of
> messages, I believe
> this will not scale well if the messages cross scheduler
> boundaries. I could
> be wrong, but it fits the anecdotal data. It would be very
> interesting to
> see the effect of designing an application so that
> processes that send lots
> of messages to each other run on the same scheduler, sort
> of "clustering"
> the messages so that they stay within the same VM's
> memory space. Using
> multiple VMs with +S 1 would force this to happen because
> there IS only one
> scheduler thread, so hopefully the VM doesn't use any
> SMP locks/mutexes
> under those conditions. This is in contrast to assuming a
> uniform model
> where the cost of sending messages is the same regardless
> of the fact that
> it may be going to a different process space and incurring
> the cost of a
> lock. The locks used are Posix semaphores, IIRC, and they
> are not cheap.
> 
> I guess what I am trying to say is that the basic
> assumption Erlang
> applications seem to have been designed on top of up to
> now, knowing that
> the cost of sending a message is extremely low, is perhaps
> not as true as it
> used to be when using SMP. I would be very interested in
> hearing from Joe
> Armstrong about this. I seem to recall that he wrote
> something about the
> cost of what a process does needing to be much greater than
> the cost of
> starting a process or sending a message to it in order to
> scale. This is
> true of any IPC, even cheap IPC like Erlang. On top of
> that, things that use
> shared structures like ETS heavily, for example Mnesia
> transactions, are
> possibly going to suffer in the SMP context. All of this is
> just conjecture
> based on my own work and anecdotal evidence presented by
> some others, but I
> feel in my bones that there is something to this. Look at
> is this way:
> Erlang got its major performance from eliminating locks.
> Running on a 1024
> processor system in SMP mode and treating it like it's
> one big uniform
> processor is going to backfire badly. I haven't seen
> much discussion about
> this. Maybe it's too obvious to mention.
> 
> There is an interesting discussion (
> http://www.erlang.org/pipermail/erlang-questions/2008-January/032273.html)
> about assigning individual +S 1 Erlang VMs to a given CPU
> using processor
> affinity. This could help considerably by leaving more of a
> VM's running
> state cached in the processor code and data caches than if
> the VM's thread
> were to be switched between CPUs by the OS.
> 
> I plan Real Soon Now ;-) to do some extensive research into
> this to see if
> there is merit to it. Just as soon as I get over the hump
> in a current
> project.
> 
> Anyway, hopefully this clarifies my thinking to you and you
> have less of an
> issue with it.
> 
> 
> On Sun, Sep 14, 2008 at 3:21 AM, Valentin Micic
> <valentin@REDACTED>wrote:
> 
> > I havent read the whole correspndance (it seems to be
> going on for a way
> > too long), but like to add my 2c worth...
> >
> > While SMP (+S1) approach may solve some problems, it
> defeats the purpose of
> > having a multi-core machine. Please note that
> multi-core machines have lower
> > clock speeds, thus should run generally slower per
> given CPU core. IMHO, if
> > +S 1 solves your problem, maybe you should revisit
> your code -- I think that
> > it is wrong to expect that the same code would work
> better on SMP just
> > because you had such expectations. For example, it is
> known fact that ETS
> > works slower in SMP environment.
> > Also, one should not forget to use +A in addition to
> +S -- although you do
> > not have any disk I/O, I think this parameter is
> relevant for PORT
> > scheduling, therefore improving performance of your
> I/O.
> >
> > V.
> >
> >
> > ----- Original Message ----- From: "Kevin
> Scaldeferri" <
> > kevin@REDACTED>
> > To: "Edwin Fine"
> <erlang-questions_efine@REDACTED>
> > Cc: "erlang-questions Questions"
> <erlang-questions@REDACTED>
> > Sent: Sunday, September 14, 2008 12:07 AM
> > Subject: Re: [erlang-questions] My frustration with
> Erlang
> >
> >
> >
> >> On Sep 13, 2008, at 1:56 PM, Edwin Fine wrote:
> >>
> >>  You'd probably have to partition the load to
> round-robin across the
> >>> individual VMs, possibly using some front-end
> load-balancing
> >>> hardware. This is why I keep harping on this:
> some time ago I put
> >>> the system I am working on under heavy load to
> test the maximum
> >>> possible throughput. There was no appreciable
> disk I/O. The kicker
> >>> is that I did not see an even distribution of
> load across the 4
> >>> cores of my box. In fact, it looked as if one
> or maybe two cores
> >>> were being used at 100% and the rest were
> idle. When I re-ran the
> >>> test on a whim, using only 1 non-SMP (+S 1)
> node, I actually got
> >>> better performance.
> >>>
> >>> This seemed counter-intuitive and against the
> "Erlang SMP scales
> >>> linearly for CPU-intensive loads" idea. I
> have not done a lot of
> >>> investigation into this because I have other
> fish to fry right now,
> >>> but the folks over at LShift (RabbitMQ) -
> assuming I did not
> >>> misunderstand them - wrote that they had seen
> similar behavior when
> >>> running clustered Rabbit nodes (i.e. better
> performance from N
> >>> single-CPU nodes than N +S N nodes). However,
> they, like me, are not
> >>> ready to come out and state this bluntly as a
> fact because (I
> >>> believe) they feel not enough investigation
> has been done to make
> >>> this a conclusive case.
> >>>
> >>
> >> I've also been seeing similar behavior trying
> to parallelize the
> >> alioth shootout code, fwiw.  I'd also say
> it's premature to draw any
> >> concrete conclusions, but another anecdotal point.
> >>
> >> (Also, on the particular OS & hardware the
> benchmarks run on, the
> >> total CPU usage nearly doubles for the parallel
> implementations.  On
> >> my 2-core mac, though, I see no more than a 10%
> increase in total CPU
> >> usage, and a near 100% improvement in the
> wall-time, as one should
> >> expect on the embarrassingly parallel problems. 
> Dunno, if this is
> >> related to the OS, the chip (Core 2 Duo vs Core 2
> Quad), HiPE, or what.)
> >>
> >>
> >> -kevin
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@REDACTED
> >>
> http://www.erlang.org/mailman/listinfo/erlang-questions
> >>
> >
> >
> >
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions