[erlang-questions] gen_server bottleneck

Sat Dec 15 07:30:57 CET 2012

Hi Saravanan,

Thank you for explanations. Please correct me if I am wrong: neglecting the
fact that you have many processes you have to execute event one by one on
one gen_server because ir requires by simulation approach. Is that correct?

Best regards,
Max

On Sat, Dec 15, 2012 at 6:22 AM, Saravanan Vijayakumaran
<sarva.v@REDACTED>wrote:

> Dear Max,
>                 There are in fact 10,000 client and server processes. Each
> of them is a gen_server [1][2]. These are just simulation entities which
> are connected by point-to-point channels [3] which are also a simulation
> entities.
>
> The purpose of network simulation is to guide decision making with respect
> to choice of network protocols and parameters in a particular scenario. The
> scenarios are typically too expensive to try out in real networks. Imagine
> a thousand node network placed in different cities.
>
> Regarding discrete event simulation, suppose we want to figure out how
> many runways to have in a new airport. The more the runways the more costly
> it is but it is also will result in less waiting time for
> arriving/departing flights. One way to characterize the runway count vs
> waiting time tradeoff is to use queueing theory. An alternative is to do
> simulate flight arrivals and departures. At any point of time, the
> simulated world will consists of flights in transit, flights waiting to
> depart, flights waiting to land, flights departing and flights landing. In
> discrete time simulation, the state of the simulated world is updated at a
> regular time step (say every minute). Note that this is simulated time and
> not wall-clock time. An alternative which is much faster to execute is
> discrete event simulation. Here there is a queue of simulation events with
> the earliest events at the head of the queue. The events are executed one
> by one by picking the event at the head of the  queue. When a flight
> departure event is executed, a waiting to land event is inserted into the
> queue having a timestamp equal to current time + the transit time of the
> flight. When the waiting to land event is executed, if a runway is free a
> land event is inserted into the queue having timestamp equal to the current
> time + landing time. If a runway is not free, a circle the airport and
> check again event is inserted into the queue having timestamp equal to the
> current time + circling time. The purpose of the simulation queue is to
> provide causal ordering of the simulation events.
>
> In my case, I think it is the simulation queue which is becoming the
> bottleneck. I must add that network simulation falls under the category of
> number-crunching applications which Erlang is not supposed to be suited
> for. But people have been trying to accelerate it in C++-based simulators
> using pthreads. I thought it was worth a try in Erlang given the amount of
> work which has already gone into Erlang SMP.
>
> Regards
> sarva
>
> [1]
> https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_client.erl
> [2]
> https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_server.erl
> [3] https://github.com/avras/nsime/blob/master/src/nsime_ptp_channel.erl
>
>
>
> On Sat, Dec 15, 2012 at 1:42 AM, Max Bourinov <bourinov@REDACTED> wrote:
>
>> Hi guys,
>>
>> Saravanan, maybe I am missing the point, but why not to create 10.000
>> clients and servers processes? In the code you provided I don't see much
>> process creation...
>>
>> Described load is relatively small and you should not see any performance
>> issues with this amount of UDP traffic.
>>
>> p.s.: I am not familiar with discrete system simulation, so maybe my view
>> is absolutely wrong))
>>
>> Best regards,
>> Max
>>
>>
>>
>>
>> On Fri, Dec 14, 2012 at 11:21 PM, Saravanan Vijayakumaran <
>> sarva.v@REDACTED> wrote:
>>
>>> Dear Garrett, Daniel and Olivier,
>>>                                               Thanks for your responses.
>>> I will try your suggestions.
>>>
>>> @Olivier: Thanks for sending the Sim-Diasca sources and docs earlier.
>>> The ns-3 simulator which is the basis for NSIME is discrete event. Since
>>> Sim-Diasca was discrete time, I put it on the backburner and haven't got a
>>> chance to check it out.
>>>
>>> @All: Let me elaborate a little on the bottleneck I am seeing.
>>>
>>> The simulation scenario consists of 10,000 UDP echo client server pairs
>>> with a pairwise point to point connection between them. Each UDP echo
>>> client sends 10 packets to a unique UDP server with a inter-packet time of
>>> 1s. Each UDP server replies to every packet it receives. In terms of the
>>> simulation, the 10k UDP echo clients each schedule a packet send event
>>> using a gen_server:call on nsime_simulator. When the simulation is run, the
>>> send events are executed resulting in receive events being scheduled at the
>>> UDP echo servers. The receive events are executed resulting in reply events
>>> being scheduled. Every event being scheduled is a gen_server:call on the
>>> nsime_simulator process.
>>>
>>> The parallelism I am trying to exploit right now is in the form of
>>> simultaneous events (events having same time stamp). All the 10k send
>>> events are simultaneous, the 10k receives are simultaneous as the point to
>>> point connections between each pair has the same delay and data rate and
>>> the 10k reply events are simultaneous.
>>>
>>> The parallel execution of the simultaneous events is done using the
>>> following code fragment from src/nsime_simulator.erl. The gen_server:call
>>> returns a list of simultaneous events and Stephen Marsh's plists module (
>>> https://github.com/rmies/plists) is used to execute each event (an MFA
>>> triple) in the list by dividing the list among 5 processes. This was for a
>>> quad core machine. The plists code suggests using  number of cores + 1.
>>> Each time an event is executed it might result in another event being
>>> scheduled which itself is a gen_server:call to nsime_simulator.
>>>
>>> parallel_run() ->
>>>     case gen_server:call(?MODULE, parallel_run) of
>>>         {events, EventList} ->
>>>             plists:foreach(
>>>                 fun(Event) ->
>>>                     erlang:apply(
>>>                         Event#nsime_event.module,
>>>                         Event#nsime_event.function,
>>>                         Event#nsime_event.arguments
>>>                     )
>>>
>>>                 end,
>>>                 EventList,
>>>                 {processes, 5}
>>>
>>>
>>>
>>>
>>>
>>>             ),
>>>             ?MODULE:parallel_run();
>>>         none ->
>>>             simulation_complete
>>>     end.
>>>
>>>
>>> On a quad core machine, the ns-3 C++ code for this example runs in 59
>>> seconds while occupying only one core. The NSIME code runs in 65 seconds
>>> while occupying about 300% out of a maximum of 400%. That doesn't sound too
>>> bad. Another non-trivial C++ vs Erlang benchmark, one can argue. But the
>>> NSIME code which does not execute the simultaneous events in parallel runs
>>> in 79 seconds while occupying only one core . So I am seeing sublinear
>>> speedups.
>>>
>>> The even more surprising thing is that on a 32 core machine the
>>> parallel_run runs in 68 seconds when the simultaneous event list is divided
>>> among 32 processes. Since the 32 core and quad core had different CPU
>>> speeds I ran the simulation on the 32 core machine with the simultaneous
>>> event list divided among 5 processes and that took 66 seconds. So I
>>> actually saw a slowdown when I use more cores. When I divided the event
>>> list among 32 processes, only 30-40% of each core was being utilized (as
>>> seen in htop).
>>>
>>> I hope my description makes some sense. My gut feeling is that the
>>> single nsime_simulator is the bottleneck. I will try to confirm it using
>>> the Erlang profiling tools. I was hoping pg or pg2 can provide a solution
>>> by distributing the message handling workload. I could schedule events in a
>>> randomly chosen process in the process group and then collect the earliest
>>> events from all the process group members. But to choose a random process I
>>> may have to use pg2:get_closest_pid which itself may cause a bottleneck.
>>>
>>> sarva
>>>
>>>
>>>
>>>
>>> On Fri, Dec 14, 2012 at 9:35 PM, Daniel Luna <daniel@REDACTED> wrote:
>>>
>>>> On 14 December 2012 09:51, Garrett Smith <g@REDACTED> wrote:
>>>> > On Fri, Dec 14, 2012 at 8:47 AM, Garrett Smith <g@REDACTED> wrote:
>>>> >> Hi Saravanan,
>>>> >> If you're bottlenecking on CPU (all your cores are fully utilized at
>>>> >> peak load) then you need either a faster machine or you'll need to
>>>> >> distribute your application to multiple machines.
>>>> >
>>>> > I should add there a number of ways you can improve efficiency, short
>>>> > of adding hardware resources. The big win, once you understand what to
>>>> > target, is C ports (or NIFs).
>>>>
>>>> But long before you start looking at either NIFs or new hardware, look
>>>> into the complexity of the code itself.  Does that expensive function
>>>> even have to be called in all cases, etc.
>>>>
>>>> I guess this is my pet peeve when it comes to optimizing anything.
>>>>
>>>> First you optimize for readability (often gaining speed or at least
>>>> finding issues)
>>>> Then you measure
>>>> Then you optimize the hotspots you've discovered by measuring
>>>> *Then* you can start looking into hardware or non-Erlang solutions
>>>>
>>>> I've also seen situations where minor changes in the requirements have
>>>> seen the possibility to speed up code by a factor 10 so that's also a
>>>> possibility.
>>>>
>>>> Cheers,
>>>>
>>>> Daniel
>>>>
>>>
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121215/5e9a587e/attachment.htm>