[erlang-questions] gen_server bottleneck

Fri Dec 14 20:21:02 CET 2012

Dear Garrett, Daniel and Olivier,
                                              Thanks for your responses. I
will try your suggestions.

@Olivier: Thanks for sending the Sim-Diasca sources and docs earlier. The
ns-3 simulator which is the basis for NSIME is discrete event. Since
Sim-Diasca was discrete time, I put it on the backburner and haven't got a
chance to check it out.

@All: Let me elaborate a little on the bottleneck I am seeing.

The simulation scenario consists of 10,000 UDP echo client server pairs
with a pairwise point to point connection between them. Each UDP echo
client sends 10 packets to a unique UDP server with a inter-packet time of
1s. Each UDP server replies to every packet it receives. In terms of the
simulation, the 10k UDP echo clients each schedule a packet send event
using a gen_server:call on nsime_simulator. When the simulation is run, the
send events are executed resulting in receive events being scheduled at the
UDP echo servers. The receive events are executed resulting in reply events
being scheduled. Every event being scheduled is a gen_server:call on the
nsime_simulator process.

The parallelism I am trying to exploit right now is in the form of
simultaneous events (events having same time stamp). All the 10k send
events are simultaneous, the 10k receives are simultaneous as the point to
point connections between each pair has the same delay and data rate and
the 10k reply events are simultaneous.

The parallel execution of the simultaneous events is done using the
following code fragment from src/nsime_simulator.erl. The gen_server:call
returns a list of simultaneous events and Stephen Marsh's plists module (
https://github.com/rmies/plists) is used to execute each event (an MFA
triple) in the list by dividing the list among 5 processes. This was for a
quad core machine. The plists code suggests using  number of cores + 1.
Each time an event is executed it might result in another event being
scheduled which itself is a gen_server:call to nsime_simulator.

parallel_run() ->
    case gen_server:call(?MODULE, parallel_run) of
        {events, EventList} ->
            plists:foreach(
                fun(Event) ->
                    erlang:apply(
                        Event#nsime_event.module,
                        Event#nsime_event.function,
                        Event#nsime_event.arguments
                    )

                end,
                EventList,
                {processes, 5}
            ),
            ?MODULE:parallel_run();
        none ->
            simulation_complete
    end.

On a quad core machine, the ns-3 C++ code for this example runs in 59
seconds while occupying only one core. The NSIME code runs in 65 seconds
while occupying about 300% out of a maximum of 400%. That doesn't sound too
bad. Another non-trivial C++ vs Erlang benchmark, one can argue. But the
NSIME code which does not execute the simultaneous events in parallel runs
in 79 seconds while occupying only one core . So I am seeing sublinear
speedups.

The even more surprising thing is that on a 32 core machine the
parallel_run runs in 68 seconds when the simultaneous event list is divided
among 32 processes. Since the 32 core and quad core had different CPU
speeds I ran the simulation on the 32 core machine with the simultaneous
event list divided among 5 processes and that took 66 seconds. So I
actually saw a slowdown when I use more cores. When I divided the event
list among 32 processes, only 30-40% of each core was being utilized (as
seen in htop).

I hope my description makes some sense. My gut feeling is that the single
nsime_simulator is the bottleneck. I will try to confirm it using the
Erlang profiling tools. I was hoping pg or pg2 can provide a solution by
distributing the message handling workload. I could schedule events in a
randomly chosen process in the process group and then collect the earliest
events from all the process group members. But to choose a random process I
may have to use pg2:get_closest_pid which itself may cause a bottleneck.

sarva

On Fri, Dec 14, 2012 at 9:35 PM, Daniel Luna <daniel@REDACTED> wrote:

> On 14 December 2012 09:51, Garrett Smith <g@REDACTED> wrote:
> > On Fri, Dec 14, 2012 at 8:47 AM, Garrett Smith <g@REDACTED> wrote:
> >> Hi Saravanan,
> >> If you're bottlenecking on CPU (all your cores are fully utilized at
> >> peak load) then you need either a faster machine or you'll need to
> >> distribute your application to multiple machines.
> >
> > I should add there a number of ways you can improve efficiency, short
> > of adding hardware resources. The big win, once you understand what to
> > target, is C ports (or NIFs).
>
> But long before you start looking at either NIFs or new hardware, look
> into the complexity of the code itself.  Does that expensive function
> even have to be called in all cases, etc.
>
> I guess this is my pet peeve when it comes to optimizing anything.
>
> First you optimize for readability (often gaining speed or at least
> finding issues)
> Then you measure
> Then you optimize the hotspots you've discovered by measuring
> *Then* you can start looking into hardware or non-Erlang solutions
>
> I've also seen situations where minor changes in the requirements have
> seen the possibility to speed up code by a factor 10 so that's also a
> possibility.
>
> Cheers,
>
> Daniel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121215/ff44a3c6/attachment.htm>