[erlang-questions] gen_server bottleneck

Saravanan Vijayakumaran sarva.v@REDACTED
Sat Dec 15 03:22:12 CET 2012

Dear Max,
                There are in fact 10,000 client and server processes. Each
of them is a gen_server [1][2]. These are just simulation entities which
are connected by point-to-point channels [3] which are also a simulation

The purpose of network simulation is to guide decision making with respect
to choice of network protocols and parameters in a particular scenario. The
scenarios are typically too expensive to try out in real networks. Imagine
a thousand node network placed in different cities.

Regarding discrete event simulation, suppose we want to figure out how many
runways to have in a new airport. The more the runways the more costly it
is but it is also will result in less waiting time for arriving/departing
flights. One way to characterize the runway count vs waiting time tradeoff
is to use queueing theory. An alternative is to do simulate flight arrivals
and departures. At any point of time, the simulated world will consists of
flights in transit, flights waiting to depart, flights waiting to land,
flights departing and flights landing. In discrete time simulation, the
state of the simulated world is updated at a regular time step (say every
minute). Note that this is simulated time and not wall-clock time. An
alternative which is much faster to execute is discrete event simulation.
Here there is a queue of simulation events with the earliest events at the
head of the queue. The events are executed one by one by picking the event
at the head of the  queue. When a flight departure event is executed, a
waiting to land event is inserted into the queue having a timestamp equal
to current time + the transit time of the flight. When the waiting to land
event is executed, if a runway is free a land event is inserted into the
queue having timestamp equal to the current time + landing time. If a
runway is not free, a circle the airport and check again event is inserted
into the queue having timestamp equal to the current time + circling time.
The purpose of the simulation queue is to provide causal ordering of the
simulation events.

In my case, I think it is the simulation queue which is becoming the
bottleneck. I must add that network simulation falls under the category of
number-crunching applications which Erlang is not supposed to be suited
for. But people have been trying to accelerate it in C++-based simulators
using pthreads. I thought it was worth a try in Erlang given the amount of
work which has already gone into Erlang SMP.


[1] https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_client.erl
[2] https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_server.erl
[3] https://github.com/avras/nsime/blob/master/src/nsime_ptp_channel.erl

On Sat, Dec 15, 2012 at 1:42 AM, Max Bourinov <bourinov@REDACTED> wrote:

> Hi guys,
> Saravanan, maybe I am missing the point, but why not to create 10.000
> clients and servers processes? In the code you provided I don't see much
> process creation...
> Described load is relatively small and you should not see any performance
> issues with this amount of UDP traffic.
> p.s.: I am not familiar with discrete system simulation, so maybe my view
> is absolutely wrong))
> Best regards,
> Max
> On Fri, Dec 14, 2012 at 11:21 PM, Saravanan Vijayakumaran <
> sarva.v@REDACTED> wrote:
>> Dear Garrett, Daniel and Olivier,
>>                                               Thanks for your responses.
>> I will try your suggestions.
>> @Olivier: Thanks for sending the Sim-Diasca sources and docs earlier. The
>> ns-3 simulator which is the basis for NSIME is discrete event. Since
>> Sim-Diasca was discrete time, I put it on the backburner and haven't got a
>> chance to check it out.
>> @All: Let me elaborate a little on the bottleneck I am seeing.
>> The simulation scenario consists of 10,000 UDP echo client server pairs
>> with a pairwise point to point connection between them. Each UDP echo
>> client sends 10 packets to a unique UDP server with a inter-packet time of
>> 1s. Each UDP server replies to every packet it receives. In terms of the
>> simulation, the 10k UDP echo clients each schedule a packet send event
>> using a gen_server:call on nsime_simulator. When the simulation is run, the
>> send events are executed resulting in receive events being scheduled at the
>> UDP echo servers. The receive events are executed resulting in reply events
>> being scheduled. Every event being scheduled is a gen_server:call on the
>> nsime_simulator process.
>> The parallelism I am trying to exploit right now is in the form of
>> simultaneous events (events having same time stamp). All the 10k send
>> events are simultaneous, the 10k receives are simultaneous as the point to
>> point connections between each pair has the same delay and data rate and
>> the 10k reply events are simultaneous.
>> The parallel execution of the simultaneous events is done using the
>> following code fragment from src/nsime_simulator.erl. The gen_server:call
>> returns a list of simultaneous events and Stephen Marsh's plists module (
>> https://github.com/rmies/plists) is used to execute each event (an MFA
>> triple) in the list by dividing the list among 5 processes. This was for a
>> quad core machine. The plists code suggests using  number of cores + 1.
>> Each time an event is executed it might result in another event being
>> scheduled which itself is a gen_server:call to nsime_simulator.
>> parallel_run() ->
>>     case gen_server:call(?MODULE, parallel_run) of
>>         {events, EventList} ->
>>             plists:foreach(
>>                 fun(Event) ->
>>                     erlang:apply(
>>                         Event#nsime_event.module,
>>                         Event#nsime_event.function,
>>                         Event#nsime_event.arguments
>>                     )
>>                 end,
>>                 EventList,
>>                 {processes, 5}
>>             ),
>>             ?MODULE:parallel_run();
>>         none ->
>>             simulation_complete
>>     end.
>> On a quad core machine, the ns-3 C++ code for this example runs in 59
>> seconds while occupying only one core. The NSIME code runs in 65 seconds
>> while occupying about 300% out of a maximum of 400%. That doesn't sound too
>> bad. Another non-trivial C++ vs Erlang benchmark, one can argue. But the
>> NSIME code which does not execute the simultaneous events in parallel runs
>> in 79 seconds while occupying only one core . So I am seeing sublinear
>> speedups.
>> The even more surprising thing is that on a 32 core machine the
>> parallel_run runs in 68 seconds when the simultaneous event list is divided
>> among 32 processes. Since the 32 core and quad core had different CPU
>> speeds I ran the simulation on the 32 core machine with the simultaneous
>> event list divided among 5 processes and that took 66 seconds. So I
>> actually saw a slowdown when I use more cores. When I divided the event
>> list among 32 processes, only 30-40% of each core was being utilized (as
>> seen in htop).
>> I hope my description makes some sense. My gut feeling is that the single
>> nsime_simulator is the bottleneck. I will try to confirm it using the
>> Erlang profiling tools. I was hoping pg or pg2 can provide a solution by
>> distributing the message handling workload. I could schedule events in a
>> randomly chosen process in the process group and then collect the earliest
>> events from all the process group members. But to choose a random process I
>> may have to use pg2:get_closest_pid which itself may cause a bottleneck.
>> sarva
>> On Fri, Dec 14, 2012 at 9:35 PM, Daniel Luna <daniel@REDACTED> wrote:
>>> On 14 December 2012 09:51, Garrett Smith <g@REDACTED> wrote:
>>> > On Fri, Dec 14, 2012 at 8:47 AM, Garrett Smith <g@REDACTED> wrote:
>>> >> Hi Saravanan,
>>> >> If you're bottlenecking on CPU (all your cores are fully utilized at
>>> >> peak load) then you need either a faster machine or you'll need to
>>> >> distribute your application to multiple machines.
>>> >
>>> > I should add there a number of ways you can improve efficiency, short
>>> > of adding hardware resources. The big win, once you understand what to
>>> > target, is C ports (or NIFs).
>>> But long before you start looking at either NIFs or new hardware, look
>>> into the complexity of the code itself.  Does that expensive function
>>> even have to be called in all cases, etc.
>>> I guess this is my pet peeve when it comes to optimizing anything.
>>> First you optimize for readability (often gaining speed or at least
>>> finding issues)
>>> Then you measure
>>> Then you optimize the hotspots you've discovered by measuring
>>> *Then* you can start looking into hardware or non-Erlang solutions
>>> I've also seen situations where minor changes in the requirements have
>>> seen the possibility to speed up code by a factor 10 so that's also a
>>> possibility.
>>> Cheers,
>>> Daniel
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121215/377d8c06/attachment.htm>

More information about the erlang-questions mailing list