[erlang-questions] gen_server bottleneck
Saravanan Vijayakumaran
sarva.v@REDACTED
Sat Dec 15 10:45:28 CET 2012
Yes. That is correct. The nsime_simulator gen_server is the central
authority which decides the event ordering.
On Sat, Dec 15, 2012 at 12:00 PM, Max Bourinov <bourinov@REDACTED> wrote:
> Hi Saravanan,
>
> Thank you for explanations. Please correct me if I am wrong: neglecting
> the fact that you have many processes you have to execute event one by one
> on one gen_server because ir requires by simulation approach. Is that
> correct?
>
> Best regards,
> Max
>
>
> On Sat, Dec 15, 2012 at 6:22 AM, Saravanan Vijayakumaran <
> sarva.v@REDACTED> wrote:
>
>> Dear Max,
>> There are in fact 10,000 client and server processes.
>> Each of them is a gen_server [1][2]. These are just simulation entities
>> which are connected by point-to-point channels [3] which are also a
>> simulation entities.
>>
>> The purpose of network simulation is to guide decision making with
>> respect to choice of network protocols and parameters in a particular
>> scenario. The scenarios are typically too expensive to try out in real
>> networks. Imagine a thousand node network placed in different cities.
>>
>> Regarding discrete event simulation, suppose we want to figure out how
>> many runways to have in a new airport. The more the runways the more costly
>> it is but it is also will result in less waiting time for
>> arriving/departing flights. One way to characterize the runway count vs
>> waiting time tradeoff is to use queueing theory. An alternative is to do
>> simulate flight arrivals and departures. At any point of time, the
>> simulated world will consists of flights in transit, flights waiting to
>> depart, flights waiting to land, flights departing and flights landing. In
>> discrete time simulation, the state of the simulated world is updated at a
>> regular time step (say every minute). Note that this is simulated time and
>> not wall-clock time. An alternative which is much faster to execute is
>> discrete event simulation. Here there is a queue of simulation events with
>> the earliest events at the head of the queue. The events are executed one
>> by one by picking the event at the head of the queue. When a flight
>> departure event is executed, a waiting to land event is inserted into the
>> queue having a timestamp equal to current time + the transit time of the
>> flight. When the waiting to land event is executed, if a runway is free a
>> land event is inserted into the queue having timestamp equal to the current
>> time + landing time. If a runway is not free, a circle the airport and
>> check again event is inserted into the queue having timestamp equal to the
>> current time + circling time. The purpose of the simulation queue is to
>> provide causal ordering of the simulation events.
>>
>> In my case, I think it is the simulation queue which is becoming the
>> bottleneck. I must add that network simulation falls under the category of
>> number-crunching applications which Erlang is not supposed to be suited
>> for. But people have been trying to accelerate it in C++-based simulators
>> using pthreads. I thought it was worth a try in Erlang given the amount of
>> work which has already gone into Erlang SMP.
>>
>> Regards
>> sarva
>>
>> [1]
>> https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_client.erl
>> [2]
>> https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_server.erl
>> [3] https://github.com/avras/nsime/blob/master/src/nsime_ptp_channel.erl
>>
>>
>>
>> On Sat, Dec 15, 2012 at 1:42 AM, Max Bourinov <bourinov@REDACTED> wrote:
>>
>>> Hi guys,
>>>
>>> Saravanan, maybe I am missing the point, but why not to create 10.000
>>> clients and servers processes? In the code you provided I don't see much
>>> process creation...
>>>
>>> Described load is relatively small and you should not see any
>>> performance issues with this amount of UDP traffic.
>>>
>>> p.s.: I am not familiar with discrete system simulation, so maybe my
>>> view is absolutely wrong))
>>>
>>> Best regards,
>>> Max
>>>
>>>
>>>
>>>
>>> On Fri, Dec 14, 2012 at 11:21 PM, Saravanan Vijayakumaran <
>>> sarva.v@REDACTED> wrote:
>>>
>>>> Dear Garrett, Daniel and Olivier,
>>>> Thanks for your
>>>> responses. I will try your suggestions.
>>>>
>>>> @Olivier: Thanks for sending the Sim-Diasca sources and docs earlier.
>>>> The ns-3 simulator which is the basis for NSIME is discrete event. Since
>>>> Sim-Diasca was discrete time, I put it on the backburner and haven't got a
>>>> chance to check it out.
>>>>
>>>> @All: Let me elaborate a little on the bottleneck I am seeing.
>>>>
>>>> The simulation scenario consists of 10,000 UDP echo client server pairs
>>>> with a pairwise point to point connection between them. Each UDP echo
>>>> client sends 10 packets to a unique UDP server with a inter-packet time of
>>>> 1s. Each UDP server replies to every packet it receives. In terms of the
>>>> simulation, the 10k UDP echo clients each schedule a packet send event
>>>> using a gen_server:call on nsime_simulator. When the simulation is run, the
>>>> send events are executed resulting in receive events being scheduled at the
>>>> UDP echo servers. The receive events are executed resulting in reply events
>>>> being scheduled. Every event being scheduled is a gen_server:call on the
>>>> nsime_simulator process.
>>>>
>>>> The parallelism I am trying to exploit right now is in the form of
>>>> simultaneous events (events having same time stamp). All the 10k send
>>>> events are simultaneous, the 10k receives are simultaneous as the point to
>>>> point connections between each pair has the same delay and data rate and
>>>> the 10k reply events are simultaneous.
>>>>
>>>> The parallel execution of the simultaneous events is done using the
>>>> following code fragment from src/nsime_simulator.erl. The gen_server:call
>>>> returns a list of simultaneous events and Stephen Marsh's plists module (
>>>> https://github.com/rmies/plists) is used to execute each event (an MFA
>>>> triple) in the list by dividing the list among 5 processes. This was for a
>>>> quad core machine. The plists code suggests using number of cores + 1.
>>>> Each time an event is executed it might result in another event being
>>>> scheduled which itself is a gen_server:call to nsime_simulator.
>>>>
>>>> parallel_run() ->
>>>> case gen_server:call(?MODULE, parallel_run) of
>>>> {events, EventList} ->
>>>> plists:foreach(
>>>> fun(Event) ->
>>>> erlang:apply(
>>>> Event#nsime_event.module,
>>>> Event#nsime_event.function,
>>>> Event#nsime_event.arguments
>>>> )
>>>>
>>>> end,
>>>> EventList,
>>>> {processes, 5}
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ),
>>>> ?MODULE:parallel_run();
>>>> none ->
>>>> simulation_complete
>>>> end.
>>>>
>>>>
>>>> On a quad core machine, the ns-3 C++ code for this example runs in 59
>>>> seconds while occupying only one core. The NSIME code runs in 65 seconds
>>>> while occupying about 300% out of a maximum of 400%. That doesn't sound too
>>>> bad. Another non-trivial C++ vs Erlang benchmark, one can argue. But the
>>>> NSIME code which does not execute the simultaneous events in parallel runs
>>>> in 79 seconds while occupying only one core . So I am seeing sublinear
>>>> speedups.
>>>>
>>>> The even more surprising thing is that on a 32 core machine the
>>>> parallel_run runs in 68 seconds when the simultaneous event list is divided
>>>> among 32 processes. Since the 32 core and quad core had different CPU
>>>> speeds I ran the simulation on the 32 core machine with the simultaneous
>>>> event list divided among 5 processes and that took 66 seconds. So I
>>>> actually saw a slowdown when I use more cores. When I divided the event
>>>> list among 32 processes, only 30-40% of each core was being utilized (as
>>>> seen in htop).
>>>>
>>>> I hope my description makes some sense. My gut feeling is that the
>>>> single nsime_simulator is the bottleneck. I will try to confirm it using
>>>> the Erlang profiling tools. I was hoping pg or pg2 can provide a solution
>>>> by distributing the message handling workload. I could schedule events in a
>>>> randomly chosen process in the process group and then collect the earliest
>>>> events from all the process group members. But to choose a random process I
>>>> may have to use pg2:get_closest_pid which itself may cause a
>>>> bottleneck.
>>>>
>>>> sarva
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Dec 14, 2012 at 9:35 PM, Daniel Luna <daniel@REDACTED> wrote:
>>>>
>>>>> On 14 December 2012 09:51, Garrett Smith <g@REDACTED> wrote:
>>>>> > On Fri, Dec 14, 2012 at 8:47 AM, Garrett Smith <g@REDACTED> wrote:
>>>>> >> Hi Saravanan,
>>>>> >> If you're bottlenecking on CPU (all your cores are fully utilized at
>>>>> >> peak load) then you need either a faster machine or you'll need to
>>>>> >> distribute your application to multiple machines.
>>>>> >
>>>>> > I should add there a number of ways you can improve efficiency, short
>>>>> > of adding hardware resources. The big win, once you understand what
>>>>> to
>>>>> > target, is C ports (or NIFs).
>>>>>
>>>>> But long before you start looking at either NIFs or new hardware, look
>>>>> into the complexity of the code itself. Does that expensive function
>>>>> even have to be called in all cases, etc.
>>>>>
>>>>> I guess this is my pet peeve when it comes to optimizing anything.
>>>>>
>>>>> First you optimize for readability (often gaining speed or at least
>>>>> finding issues)
>>>>> Then you measure
>>>>> Then you optimize the hotspots you've discovered by measuring
>>>>> *Then* you can start looking into hardware or non-Erlang solutions
>>>>>
>>>>> I've also seen situations where minor changes in the requirements have
>>>>> seen the possibility to speed up code by a factor 10 so that's also a
>>>>> possibility.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Daniel
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> erlang-questions@REDACTED
>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121215/45c6ecc5/attachment.htm>
More information about the erlang-questions
mailing list