[erlang-questions] gen_server bottleneck

Sat Dec 15 12:08:57 CET 2012

Surely the expressive power of erlang is a justification if the
performance is 'good enough'?

One way to serialise inserts into the work queue is to use an
ordered_set ets table with a unique index such as the return value of
now(). Also look into ets:select_delete to take matching tasks out of
the queue. I haven't looked at the code in great detail, but I suspect
these might be useful to you Saravanan.

cheers
Chandru

On 15 December 2012 10:07, Max Bourinov <bourinov@REDACTED> wrote:
> Then I don't see advantage of Erlang here. Is it possible to distribute work
> that nsime_simulator does over several nodes?
>
> If not - I afraid there is no reason to use Erlang. If yes - load must be
> distributed.
>
>
> On Sat, Dec 15, 2012 at 1:45 PM, Saravanan Vijayakumaran <sarva.v@REDACTED>
> wrote:
>>
>> Yes. That is correct. The nsime_simulator gen_server is the central
>> authority which decides the event ordering.
>>
>>
>> On Sat, Dec 15, 2012 at 12:00 PM, Max Bourinov <bourinov@REDACTED> wrote:
>>>
>>> Hi Saravanan,
>>>
>>> Thank you for explanations. Please correct me if I am wrong: neglecting
>>> the fact that you have many processes you have to execute event one by one
>>> on one gen_server because ir requires by simulation approach. Is that
>>> correct?
>>>
>>> Best regards,
>>> Max
>>>
>>>
>>> On Sat, Dec 15, 2012 at 6:22 AM, Saravanan Vijayakumaran
>>> <sarva.v@REDACTED> wrote:
>>>>
>>>> Dear Max,
>>>>                 There are in fact 10,000 client and server processes.
>>>> Each of them is a gen_server [1][2]. These are just simulation entities
>>>> which are connected by point-to-point channels [3] which are also a
>>>> simulation entities.
>>>>
>>>> The purpose of network simulation is to guide decision making with
>>>> respect to choice of network protocols and parameters in a particular
>>>> scenario. The scenarios are typically too expensive to try out in real
>>>> networks. Imagine a thousand node network placed in different cities.
>>>>
>>>> Regarding discrete event simulation, suppose we want to figure out how
>>>> many runways to have in a new airport. The more the runways the more costly
>>>> it is but it is also will result in less waiting time for arriving/departing
>>>> flights. One way to characterize the runway count vs waiting time tradeoff
>>>> is to use queueing theory. An alternative is to do simulate flight arrivals
>>>> and departures. At any point of time, the simulated world will consists of
>>>> flights in transit, flights waiting to depart, flights waiting to land,
>>>> flights departing and flights landing. In discrete time simulation, the
>>>> state of the simulated world is updated at a regular time step (say every
>>>> minute). Note that this is simulated time and not wall-clock time. An
>>>> alternative which is much faster to execute is discrete event simulation.
>>>> Here there is a queue of simulation events with the earliest events at the
>>>> head of the queue. The events are executed one by one by picking the event
>>>> at the head of the  queue. When a flight departure event is executed, a
>>>> waiting to land event is inserted into the queue having a timestamp equal to
>>>> current time + the transit time of the flight. When the waiting to land
>>>> event is executed, if a runway is free a land event is inserted into the
>>>> queue having timestamp equal to the current time + landing time. If a runway
>>>> is not free, a circle the airport and check again event is inserted into the
>>>> queue having timestamp equal to the current time + circling time. The
>>>> purpose of the simulation queue is to provide causal ordering of the
>>>> simulation events.
>>>>
>>>> In my case, I think it is the simulation queue which is becoming the
>>>> bottleneck. I must add that network simulation falls under the category of
>>>> number-crunching applications which Erlang is not supposed to be suited for.
>>>> But people have been trying to accelerate it in C++-based simulators using
>>>> pthreads. I thought it was worth a try in Erlang given the amount of work
>>>> which has already gone into Erlang SMP.
>>>>
>>>> Regards
>>>> sarva
>>>>
>>>> [1]
>>>> https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_client.erl
>>>> [2]
>>>> https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_server.erl
>>>> [3] https://github.com/avras/nsime/blob/master/src/nsime_ptp_channel.erl
>>>>
>>>>
>>>>
>>>> On Sat, Dec 15, 2012 at 1:42 AM, Max Bourinov <bourinov@REDACTED>
>>>> wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> Saravanan, maybe I am missing the point, but why not to create 10.000
>>>>> clients and servers processes? In the code you provided I don't see much
>>>>> process creation...
>>>>>
>>>>> Described load is relatively small and you should not see any
>>>>> performance issues with this amount of UDP traffic.
>>>>>
>>>>> p.s.: I am not familiar with discrete system simulation, so maybe my
>>>>> view is absolutely wrong))
>>>>>
>>>>> Best regards,
>>>>> Max
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 14, 2012 at 11:21 PM, Saravanan Vijayakumaran
>>>>> <sarva.v@REDACTED> wrote:
>>>>>>
>>>>>> Dear Garrett, Daniel and Olivier,
>>>>>>                                               Thanks for your
>>>>>> responses. I will try your suggestions.
>>>>>>
>>>>>> @Olivier: Thanks for sending the Sim-Diasca sources and docs earlier.
>>>>>> The ns-3 simulator which is the basis for NSIME is discrete event. Since
>>>>>> Sim-Diasca was discrete time, I put it on the backburner and haven't got a
>>>>>> chance to check it out.
>>>>>>
>>>>>> @All: Let me elaborate a little on the bottleneck I am seeing.
>>>>>>
>>>>>> The simulation scenario consists of 10,000 UDP echo client server
>>>>>> pairs with a pairwise point to point connection between them. Each UDP echo
>>>>>> client sends 10 packets to a unique UDP server with a inter-packet time of
>>>>>> 1s. Each UDP server replies to every packet it receives. In terms of the
>>>>>> simulation, the 10k UDP echo clients each schedule a packet send event using
>>>>>> a gen_server:call on nsime_simulator. When the simulation is run, the send
>>>>>> events are executed resulting in receive events being scheduled at the UDP
>>>>>> echo servers. The receive events are executed resulting in reply events
>>>>>> being scheduled. Every event being scheduled is a gen_server:call on the
>>>>>> nsime_simulator process.
>>>>>>
>>>>>> The parallelism I am trying to exploit right now is in the form of
>>>>>> simultaneous events (events having same time stamp). All the 10k send events
>>>>>> are simultaneous, the 10k receives are simultaneous as the point to point
>>>>>> connections between each pair has the same delay and data rate and the 10k
>>>>>> reply events are simultaneous.
>>>>>>
>>>>>> The parallel execution of the simultaneous events is done using the
>>>>>> following code fragment from src/nsime_simulator.erl. The gen_server:call
>>>>>> returns a list of simultaneous events and Stephen Marsh's plists module
>>>>>> (https://github.com/rmies/plists) is used to execute each event (an MFA
>>>>>> triple) in the list by dividing the list among 5 processes. This was for a
>>>>>> quad core machine. The plists code suggests using  number of cores + 1. Each
>>>>>> time an event is executed it might result in another event being scheduled
>>>>>> which itself is a gen_server:call to nsime_simulator.
>>>>>>
>>>>>> parallel_run() ->
>>>>>>     case gen_server:call(?MODULE, parallel_run) of
>>>>>>         {events, EventList} ->
>>>>>>             plists:foreach(
>>>>>>                 fun(Event) ->
>>>>>>                     erlang:apply(
>>>>>>                         Event#nsime_event.module,
>>>>>>                         Event#nsime_event.function,
>>>>>>                         Event#nsime_event.arguments
>>>>>>                     )
>>>>>>                 end,
>>>>>>                 EventList,
>>>>>>                 {processes, 5}
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             ),
>>>>>>             ?MODULE:parallel_run();
>>>>>>         none ->
>>>>>>             simulation_complete
>>>>>>     end.
>>>>>>
>>>>>>
>>>>>> On a quad core machine, the ns-3 C++ code for this example runs in 59
>>>>>> seconds while occupying only one core. The NSIME code runs in 65 seconds
>>>>>> while occupying about 300% out of a maximum of 400%. That doesn't sound too
>>>>>> bad. Another non-trivial C++ vs Erlang benchmark, one can argue. But the
>>>>>> NSIME code which does not execute the simultaneous events in parallel runs
>>>>>> in 79 seconds while occupying only one core . So I am seeing sublinear
>>>>>> speedups.
>>>>>>
>>>>>> The even more surprising thing is that on a 32 core machine the
>>>>>> parallel_run runs in 68 seconds when the simultaneous event list is divided
>>>>>> among 32 processes. Since the 32 core and quad core had different CPU speeds
>>>>>> I ran the simulation on the 32 core machine with the simultaneous event list
>>>>>> divided among 5 processes and that took 66 seconds. So I actually saw a
>>>>>> slowdown when I use more cores. When I divided the event list among 32
>>>>>> processes, only 30-40% of each core was being utilized (as seen in htop).
>>>>>>
>>>>>> I hope my description makes some sense. My gut feeling is that the
>>>>>> single nsime_simulator is the bottleneck. I will try to confirm it using the
>>>>>> Erlang profiling tools. I was hoping pg or pg2 can provide a solution by
>>>>>> distributing the message handling workload. I could schedule events in a
>>>>>> randomly chosen process in the process group and then collect the earliest
>>>>>> events from all the process group members. But to choose a random process I
>>>>>> may have to use pg2:get_closest_pid which itself may cause a bottleneck.
>>>>>>
>>>>>> sarva
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 14, 2012 at 9:35 PM, Daniel Luna <daniel@REDACTED> wrote:
>>>>>>>
>>>>>>> On 14 December 2012 09:51, Garrett Smith <g@REDACTED> wrote:
>>>>>>> > On Fri, Dec 14, 2012 at 8:47 AM, Garrett Smith <g@REDACTED> wrote:
>>>>>>> >> Hi Saravanan,
>>>>>>> >> If you're bottlenecking on CPU (all your cores are fully utilized
>>>>>>> >> at
>>>>>>> >> peak load) then you need either a faster machine or you'll need to
>>>>>>> >> distribute your application to multiple machines.
>>>>>>> >
>>>>>>> > I should add there a number of ways you can improve efficiency,
>>>>>>> > short
>>>>>>> > of adding hardware resources. The big win, once you understand what
>>>>>>> > to
>>>>>>> > target, is C ports (or NIFs).
>>>>>>>
>>>>>>> But long before you start looking at either NIFs or new hardware,
>>>>>>> look
>>>>>>> into the complexity of the code itself.  Does that expensive function
>>>>>>> even have to be called in all cases, etc.
>>>>>>>
>>>>>>> I guess this is my pet peeve when it comes to optimizing anything.
>>>>>>>
>>>>>>> First you optimize for readability (often gaining speed or at least
>>>>>>> finding issues)
>>>>>>> Then you measure
>>>>>>> Then you optimize the hotspots you've discovered by measuring
>>>>>>> *Then* you can start looking into hardware or non-Erlang solutions
>>>>>>>
>>>>>>> I've also seen situations where minor changes in the requirements
>>>>>>> have
>>>>>>> seen the possibility to speed up code by a factor 10 so that's also a
>>>>>>> possibility.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> erlang-questions mailing list
>>>>>> erlang-questions@REDACTED
>>>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>