[erlang-questions] gen_server bottleneck

Sat Dec 15 11:07:46 CET 2012

Then I don't see advantage of Erlang here. Is it possible to distribute
work that nsime_simulator does over several nodes?

If not - I afraid there is no reason to use Erlang. If yes - load must be
distributed.

On Sat, Dec 15, 2012 at 1:45 PM, Saravanan Vijayakumaran
<sarva.v@REDACTED>wrote:

> Yes. That is correct. The nsime_simulator gen_server is the central
> authority which decides the event ordering.
>
>
> On Sat, Dec 15, 2012 at 12:00 PM, Max Bourinov <bourinov@REDACTED> wrote:
>
>> Hi Saravanan,
>>
>> Thank you for explanations. Please correct me if I am wrong: neglecting
>> the fact that you have many processes you have to execute event one by one
>> on one gen_server because ir requires by simulation approach. Is that
>> correct?
>>
>> Best regards,
>> Max
>>
>>
>> On Sat, Dec 15, 2012 at 6:22 AM, Saravanan Vijayakumaran <
>> sarva.v@REDACTED> wrote:
>>
>>> Dear Max,
>>>                 There are in fact 10,000 client and server processes.
>>> Each of them is a gen_server [1][2]. These are just simulation entities
>>> which are connected by point-to-point channels [3] which are also a
>>> simulation entities.
>>>
>>> The purpose of network simulation is to guide decision making with
>>> respect to choice of network protocols and parameters in a particular
>>> scenario. The scenarios are typically too expensive to try out in real
>>> networks. Imagine a thousand node network placed in different cities.
>>>
>>> Regarding discrete event simulation, suppose we want to figure out how
>>> many runways to have in a new airport. The more the runways the more costly
>>> it is but it is also will result in less waiting time for
>>> arriving/departing flights. One way to characterize the runway count vs
>>> waiting time tradeoff is to use queueing theory. An alternative is to do
>>> simulate flight arrivals and departures. At any point of time, the
>>> simulated world will consists of flights in transit, flights waiting to
>>> depart, flights waiting to land, flights departing and flights landing. In
>>> discrete time simulation, the state of the simulated world is updated at a
>>> regular time step (say every minute). Note that this is simulated time and
>>> not wall-clock time. An alternative which is much faster to execute is
>>> discrete event simulation. Here there is a queue of simulation events with
>>> the earliest events at the head of the queue. The events are executed one
>>> by one by picking the event at the head of the  queue. When a flight
>>> departure event is executed, a waiting to land event is inserted into the
>>> queue having a timestamp equal to current time + the transit time of the
>>> flight. When the waiting to land event is executed, if a runway is free a
>>> land event is inserted into the queue having timestamp equal to the current
>>> time + landing time. If a runway is not free, a circle the airport and
>>> check again event is inserted into the queue having timestamp equal to the
>>> current time + circling time. The purpose of the simulation queue is to
>>> provide causal ordering of the simulation events.
>>>
>>> In my case, I think it is the simulation queue which is becoming the
>>> bottleneck. I must add that network simulation falls under the category of
>>> number-crunching applications which Erlang is not supposed to be suited
>>> for. But people have been trying to accelerate it in C++-based simulators
>>> using pthreads. I thought it was worth a try in Erlang given the amount of
>>> work which has already gone into Erlang SMP.
>>>
>>> Regards
>>> sarva
>>>
>>> [1]
>>> https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_client.erl
>>> [2]
>>> https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_server.erl
>>> [3] https://github.com/avras/nsime/blob/master/src/nsime_ptp_channel.erl
>>>
>>>
>>>
>>> On Sat, Dec 15, 2012 at 1:42 AM, Max Bourinov <bourinov@REDACTED>wrote:
>>>
>>>> Hi guys,
>>>>
>>>> Saravanan, maybe I am missing the point, but why not to create 10.000
>>>> clients and servers processes? In the code you provided I don't see much
>>>> process creation...
>>>>
>>>> Described load is relatively small and you should not see any
>>>> performance issues with this amount of UDP traffic.
>>>>
>>>> p.s.: I am not familiar with discrete system simulation, so maybe my
>>>> view is absolutely wrong))
>>>>
>>>> Best regards,
>>>> Max
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Dec 14, 2012 at 11:21 PM, Saravanan Vijayakumaran <
>>>> sarva.v@REDACTED> wrote:
>>>>
>>>>> Dear Garrett, Daniel and Olivier,
>>>>>                                               Thanks for your
>>>>> responses. I will try your suggestions.
>>>>>
>>>>> @Olivier: Thanks for sending the Sim-Diasca sources and docs earlier.
>>>>> The ns-3 simulator which is the basis for NSIME is discrete event. Since
>>>>> Sim-Diasca was discrete time, I put it on the backburner and haven't got a
>>>>> chance to check it out.
>>>>>
>>>>> @All: Let me elaborate a little on the bottleneck I am seeing.
>>>>>
>>>>> The simulation scenario consists of 10,000 UDP echo client server
>>>>> pairs with a pairwise point to point connection between them. Each UDP echo
>>>>> client sends 10 packets to a unique UDP server with a inter-packet time of
>>>>> 1s. Each UDP server replies to every packet it receives. In terms of the
>>>>> simulation, the 10k UDP echo clients each schedule a packet send event
>>>>> using a gen_server:call on nsime_simulator. When the simulation is run, the
>>>>> send events are executed resulting in receive events being scheduled at the
>>>>> UDP echo servers. The receive events are executed resulting in reply events
>>>>> being scheduled. Every event being scheduled is a gen_server:call on the
>>>>> nsime_simulator process.
>>>>>
>>>>> The parallelism I am trying to exploit right now is in the form of
>>>>> simultaneous events (events having same time stamp). All the 10k send
>>>>> events are simultaneous, the 10k receives are simultaneous as the point to
>>>>> point connections between each pair has the same delay and data rate and
>>>>> the 10k reply events are simultaneous.
>>>>>
>>>>> The parallel execution of the simultaneous events is done using the
>>>>> following code fragment from src/nsime_simulator.erl. The gen_server:call
>>>>> returns a list of simultaneous events and Stephen Marsh's plists module (
>>>>> https://github.com/rmies/plists) is used to execute each event (an
>>>>> MFA triple) in the list by dividing the list among 5 processes. This was
>>>>> for a quad core machine. The plists code suggests using  number of cores +
>>>>> 1. Each time an event is executed it might result in another event being
>>>>> scheduled which itself is a gen_server:call to nsime_simulator.
>>>>>
>>>>> parallel_run() ->
>>>>>     case gen_server:call(?MODULE, parallel_run) of
>>>>>         {events, EventList} ->
>>>>>             plists:foreach(
>>>>>                 fun(Event) ->
>>>>>                     erlang:apply(
>>>>>                         Event#nsime_event.module,
>>>>>                         Event#nsime_event.function,
>>>>>                         Event#nsime_event.arguments
>>>>>                     )
>>>>>
>>>>>                 end,
>>>>>                 EventList,
>>>>>                 {processes, 5}
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>             ),
>>>>>             ?MODULE:parallel_run();
>>>>>         none ->
>>>>>             simulation_complete
>>>>>     end.
>>>>>
>>>>>
>>>>> On a quad core machine, the ns-3 C++ code for this example runs in 59
>>>>> seconds while occupying only one core. The NSIME code runs in 65 seconds
>>>>> while occupying about 300% out of a maximum of 400%. That doesn't sound too
>>>>> bad. Another non-trivial C++ vs Erlang benchmark, one can argue. But the
>>>>> NSIME code which does not execute the simultaneous events in parallel runs
>>>>> in 79 seconds while occupying only one core . So I am seeing sublinear
>>>>> speedups.
>>>>>
>>>>> The even more surprising thing is that on a 32 core machine the
>>>>> parallel_run runs in 68 seconds when the simultaneous event list is divided
>>>>> among 32 processes. Since the 32 core and quad core had different CPU
>>>>> speeds I ran the simulation on the 32 core machine with the simultaneous
>>>>> event list divided among 5 processes and that took 66 seconds. So I
>>>>> actually saw a slowdown when I use more cores. When I divided the event
>>>>> list among 32 processes, only 30-40% of each core was being utilized (as
>>>>> seen in htop).
>>>>>
>>>>> I hope my description makes some sense. My gut feeling is that the
>>>>> single nsime_simulator is the bottleneck. I will try to confirm it using
>>>>> the Erlang profiling tools. I was hoping pg or pg2 can provide a solution
>>>>> by distributing the message handling workload. I could schedule events in a
>>>>> randomly chosen process in the process group and then collect the earliest
>>>>> events from all the process group members. But to choose a random process I
>>>>> may have to use pg2:get_closest_pid which itself may cause a
>>>>> bottleneck.
>>>>>
>>>>> sarva
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 14, 2012 at 9:35 PM, Daniel Luna <daniel@REDACTED> wrote:
>>>>>
>>>>>> On 14 December 2012 09:51, Garrett Smith <g@REDACTED> wrote:
>>>>>> > On Fri, Dec 14, 2012 at 8:47 AM, Garrett Smith <g@REDACTED> wrote:
>>>>>> >> Hi Saravanan,
>>>>>> >> If you're bottlenecking on CPU (all your cores are fully utilized
>>>>>> at
>>>>>> >> peak load) then you need either a faster machine or you'll need to
>>>>>> >> distribute your application to multiple machines.
>>>>>> >
>>>>>> > I should add there a number of ways you can improve efficiency,
>>>>>> short
>>>>>> > of adding hardware resources. The big win, once you understand what
>>>>>> to
>>>>>> > target, is C ports (or NIFs).
>>>>>>
>>>>>> But long before you start looking at either NIFs or new hardware, look
>>>>>> into the complexity of the code itself.  Does that expensive function
>>>>>> even have to be called in all cases, etc.
>>>>>>
>>>>>> I guess this is my pet peeve when it comes to optimizing anything.
>>>>>>
>>>>>> First you optimize for readability (often gaining speed or at least
>>>>>> finding issues)
>>>>>> Then you measure
>>>>>> Then you optimize the hotspots you've discovered by measuring
>>>>>> *Then* you can start looking into hardware or non-Erlang solutions
>>>>>>
>>>>>> I've also seen situations where minor changes in the requirements have
>>>>>> seen the possibility to speed up code by a factor 10 so that's also a
>>>>>> possibility.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> erlang-questions mailing list
>>>>> erlang-questions@REDACTED
>>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121215/e6d7522e/attachment.htm>