[erlang-questions] gen_server bottleneck

Sat Dec 15 14:02:12 CET 2012

Max, you are right about the application not having enough decoupling
typical of Erlang applications. But as Chandrasekhar has pointed out, the
dynamic typing and higher order functions made it much easier to implement
the functionality in Erlang which was buck ugly in C++ (the ns-3 simulator
I was porting is in C++). For example, consider the following code fragment
which defines callbacks in ns-3 [1]. In NSIME, a callback is simply an MFA
triple where varying number of arguments is not an issue [2],[3]. If not
for performance, Erlang wins in terms of readability.

template <typename R>
 Callback<R> MakeCallback (R (*fnPtr)()) {
  return Callback<R> (fnPtr, true, true);
}
template <typename R, typename T1>
 Callback<R,T1> MakeCallback (R (*fnPtr)(T1)) {
  return Callback<R,T1> (fnPtr, true, true);
}
template <typename R, typename T1, typename T2>
 Callback<R,T1,T2> MakeCallback (R (*fnPtr)(T1,T2)) {
  return Callback<R,T1,T2> (fnPtr, true, true);
}
template <typename R, typename T1, typename T2,typename T3>
 Callback<R,T1,T2,T3> MakeCallback (R (*fnPtr)(T1,T2,T3)) {
  return Callback<R,T1,T2,T3> (fnPtr, true, true);
}
.... and so on till nine arguments are supported!

@Chandrasekhar Thanks for the ets pointer. I will check it out.

[1] http://www.nsnam.org/docs/release/3.15/doxygen/callback_8h_source.html
[2] https://github.com/avras/nsime/blob/master/src/nsime_callback.erl
[3] https://github.com/avras/nsime/blob/master/include/nsime_types.hrl#L40

On Sat, Dec 15, 2012 at 4:38 PM, Chandru <
chandrashekhar.mullaparthi@REDACTED> wrote:

> Surely the expressive power of erlang is a justification if the
> performance is 'good enough'?
>
> One way to serialise inserts into the work queue is to use an
> ordered_set ets table with a unique index such as the return value of
> now(). Also look into ets:select_delete to take matching tasks out of
> the queue. I haven't looked at the code in great detail, but I suspect
> these might be useful to you Saravanan.
>
> cheers
> Chandru
>
> On 15 December 2012 10:07, Max Bourinov <bourinov@REDACTED> wrote:
> > Then I don't see advantage of Erlang here. Is it possible to distribute
> work
> > that nsime_simulator does over several nodes?
> >
> > If not - I afraid there is no reason to use Erlang. If yes - load must be
> > distributed.
> >
> >
> > On Sat, Dec 15, 2012 at 1:45 PM, Saravanan Vijayakumaran <
> sarva.v@REDACTED>
> > wrote:
> >>
> >> Yes. That is correct. The nsime_simulator gen_server is the central
> >> authority which decides the event ordering.
> >>
> >>
> >> On Sat, Dec 15, 2012 at 12:00 PM, Max Bourinov <bourinov@REDACTED>
> wrote:
> >>>
> >>> Hi Saravanan,
> >>>
> >>> Thank you for explanations. Please correct me if I am wrong: neglecting
> >>> the fact that you have many processes you have to execute event one by
> one
> >>> on one gen_server because ir requires by simulation approach. Is that
> >>> correct?
> >>>
> >>> Best regards,
> >>> Max
> >>>
> >>>
> >>> On Sat, Dec 15, 2012 at 6:22 AM, Saravanan Vijayakumaran
> >>> <sarva.v@REDACTED> wrote:
> >>>>
> >>>> Dear Max,
> >>>>                 There are in fact 10,000 client and server processes.
> >>>> Each of them is a gen_server [1][2]. These are just simulation
> entities
> >>>> which are connected by point-to-point channels [3] which are also a
> >>>> simulation entities.
> >>>>
> >>>> The purpose of network simulation is to guide decision making with
> >>>> respect to choice of network protocols and parameters in a particular
> >>>> scenario. The scenarios are typically too expensive to try out in real
> >>>> networks. Imagine a thousand node network placed in different cities.
> >>>>
> >>>> Regarding discrete event simulation, suppose we want to figure out how
> >>>> many runways to have in a new airport. The more the runways the more
> costly
> >>>> it is but it is also will result in less waiting time for
> arriving/departing
> >>>> flights. One way to characterize the runway count vs waiting time
> tradeoff
> >>>> is to use queueing theory. An alternative is to do simulate flight
> arrivals
> >>>> and departures. At any point of time, the simulated world will
> consists of
> >>>> flights in transit, flights waiting to depart, flights waiting to
> land,
> >>>> flights departing and flights landing. In discrete time simulation,
> the
> >>>> state of the simulated world is updated at a regular time step (say
> every
> >>>> minute). Note that this is simulated time and not wall-clock time. An
> >>>> alternative which is much faster to execute is discrete event
> simulation.
> >>>> Here there is a queue of simulation events with the earliest events
> at the
> >>>> head of the queue. The events are executed one by one by picking the
> event
> >>>> at the head of the  queue. When a flight departure event is executed,
> a
> >>>> waiting to land event is inserted into the queue having a timestamp
> equal to
> >>>> current time + the transit time of the flight. When the waiting to
> land
> >>>> event is executed, if a runway is free a land event is inserted into
> the
> >>>> queue having timestamp equal to the current time + landing time. If a
> runway
> >>>> is not free, a circle the airport and check again event is inserted
> into the
> >>>> queue having timestamp equal to the current time + circling time. The
> >>>> purpose of the simulation queue is to provide causal ordering of the
> >>>> simulation events.
> >>>>
> >>>> In my case, I think it is the simulation queue which is becoming the
> >>>> bottleneck. I must add that network simulation falls under the
> category of
> >>>> number-crunching applications which Erlang is not supposed to be
> suited for.
> >>>> But people have been trying to accelerate it in C++-based simulators
> using
> >>>> pthreads. I thought it was worth a try in Erlang given the amount of
> work
> >>>> which has already gone into Erlang SMP.
> >>>>
> >>>> Regards
> >>>> sarva
> >>>>
> >>>> [1]
> >>>>
> https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_client.erl
> >>>> [2]
> >>>>
> https://github.com/avras/nsime/blob/master/src/nsime_udp_echo_server.erl
> >>>> [3]
> https://github.com/avras/nsime/blob/master/src/nsime_ptp_channel.erl
> >>>>
> >>>>
> >>>>
> >>>> On Sat, Dec 15, 2012 at 1:42 AM, Max Bourinov <bourinov@REDACTED>
> >>>> wrote:
> >>>>>
> >>>>> Hi guys,
> >>>>>
> >>>>> Saravanan, maybe I am missing the point, but why not to create 10.000
> >>>>> clients and servers processes? In the code you provided I don't see
> much
> >>>>> process creation...
> >>>>>
> >>>>> Described load is relatively small and you should not see any
> >>>>> performance issues with this amount of UDP traffic.
> >>>>>
> >>>>> p.s.: I am not familiar with discrete system simulation, so maybe my
> >>>>> view is absolutely wrong))
> >>>>>
> >>>>> Best regards,
> >>>>> Max
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Dec 14, 2012 at 11:21 PM, Saravanan Vijayakumaran
> >>>>> <sarva.v@REDACTED> wrote:
> >>>>>>
> >>>>>> Dear Garrett, Daniel and Olivier,
> >>>>>>                                               Thanks for your
> >>>>>> responses. I will try your suggestions.
> >>>>>>
> >>>>>> @Olivier: Thanks for sending the Sim-Diasca sources and docs
> earlier.
> >>>>>> The ns-3 simulator which is the basis for NSIME is discrete event.
> Since
> >>>>>> Sim-Diasca was discrete time, I put it on the backburner and
> haven't got a
> >>>>>> chance to check it out.
> >>>>>>
> >>>>>> @All: Let me elaborate a little on the bottleneck I am seeing.
> >>>>>>
> >>>>>> The simulation scenario consists of 10,000 UDP echo client server
> >>>>>> pairs with a pairwise point to point connection between them. Each
> UDP echo
> >>>>>> client sends 10 packets to a unique UDP server with a inter-packet
> time of
> >>>>>> 1s. Each UDP server replies to every packet it receives. In terms
> of the
> >>>>>> simulation, the 10k UDP echo clients each schedule a packet send
> event using
> >>>>>> a gen_server:call on nsime_simulator. When the simulation is run,
> the send
> >>>>>> events are executed resulting in receive events being scheduled at
> the UDP
> >>>>>> echo servers. The receive events are executed resulting in reply
> events
> >>>>>> being scheduled. Every event being scheduled is a gen_server:call
> on the
> >>>>>> nsime_simulator process.
> >>>>>>
> >>>>>> The parallelism I am trying to exploit right now is in the form of
> >>>>>> simultaneous events (events having same time stamp). All the 10k
> send events
> >>>>>> are simultaneous, the 10k receives are simultaneous as the point to
> point
> >>>>>> connections between each pair has the same delay and data rate and
> the 10k
> >>>>>> reply events are simultaneous.
> >>>>>>
> >>>>>> The parallel execution of the simultaneous events is done using the
> >>>>>> following code fragment from src/nsime_simulator.erl. The
> gen_server:call
> >>>>>> returns a list of simultaneous events and Stephen Marsh's plists
> module
> >>>>>> (https://github.com/rmies/plists) is used to execute each event
> (an MFA
> >>>>>> triple) in the list by dividing the list among 5 processes. This
> was for a
> >>>>>> quad core machine. The plists code suggests using  number of cores
> + 1. Each
> >>>>>> time an event is executed it might result in another event being
> scheduled
> >>>>>> which itself is a gen_server:call to nsime_simulator.
> >>>>>>
> >>>>>> parallel_run() ->
> >>>>>>     case gen_server:call(?MODULE, parallel_run) of
> >>>>>>         {events, EventList} ->
> >>>>>>             plists:foreach(
> >>>>>>                 fun(Event) ->
> >>>>>>                     erlang:apply(
> >>>>>>                         Event#nsime_event.module,
> >>>>>>                         Event#nsime_event.function,
> >>>>>>                         Event#nsime_event.arguments
> >>>>>>                     )
> >>>>>>                 end,
> >>>>>>                 EventList,
> >>>>>>                 {processes, 5}
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>             ),
> >>>>>>             ?MODULE:parallel_run();
> >>>>>>         none ->
> >>>>>>             simulation_complete
> >>>>>>     end.
> >>>>>>
> >>>>>>
> >>>>>> On a quad core machine, the ns-3 C++ code for this example runs in
> 59
> >>>>>> seconds while occupying only one core. The NSIME code runs in 65
> seconds
> >>>>>> while occupying about 300% out of a maximum of 400%. That doesn't
> sound too
> >>>>>> bad. Another non-trivial C++ vs Erlang benchmark, one can argue.
> But the
> >>>>>> NSIME code which does not execute the simultaneous events in
> parallel runs
> >>>>>> in 79 seconds while occupying only one core . So I am seeing
> sublinear
> >>>>>> speedups.
> >>>>>>
> >>>>>> The even more surprising thing is that on a 32 core machine the
> >>>>>> parallel_run runs in 68 seconds when the simultaneous event list is
> divided
> >>>>>> among 32 processes. Since the 32 core and quad core had different
> CPU speeds
> >>>>>> I ran the simulation on the 32 core machine with the simultaneous
> event list
> >>>>>> divided among 5 processes and that took 66 seconds. So I actually
> saw a
> >>>>>> slowdown when I use more cores. When I divided the event list among
> 32
> >>>>>> processes, only 30-40% of each core was being utilized (as seen in
> htop).
> >>>>>>
> >>>>>> I hope my description makes some sense. My gut feeling is that the
> >>>>>> single nsime_simulator is the bottleneck. I will try to confirm it
> using the
> >>>>>> Erlang profiling tools. I was hoping pg or pg2 can provide a
> solution by
> >>>>>> distributing the message handling workload. I could schedule events
> in a
> >>>>>> randomly chosen process in the process group and then collect the
> earliest
> >>>>>> events from all the process group members. But to choose a random
> process I
> >>>>>> may have to use pg2:get_closest_pid which itself may cause a
> bottleneck.
> >>>>>>
> >>>>>> sarva
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Dec 14, 2012 at 9:35 PM, Daniel Luna <daniel@REDACTED>
> wrote:
> >>>>>>>
> >>>>>>> On 14 December 2012 09:51, Garrett Smith <g@REDACTED> wrote:
> >>>>>>> > On Fri, Dec 14, 2012 at 8:47 AM, Garrett Smith <g@REDACTED> wrote:
> >>>>>>> >> Hi Saravanan,
> >>>>>>> >> If you're bottlenecking on CPU (all your cores are fully
> utilized
> >>>>>>> >> at
> >>>>>>> >> peak load) then you need either a faster machine or you'll need
> to
> >>>>>>> >> distribute your application to multiple machines.
> >>>>>>> >
> >>>>>>> > I should add there a number of ways you can improve efficiency,
> >>>>>>> > short
> >>>>>>> > of adding hardware resources. The big win, once you understand
> what
> >>>>>>> > to
> >>>>>>> > target, is C ports (or NIFs).
> >>>>>>>
> >>>>>>> But long before you start looking at either NIFs or new hardware,
> >>>>>>> look
> >>>>>>> into the complexity of the code itself.  Does that expensive
> function
> >>>>>>> even have to be called in all cases, etc.
> >>>>>>>
> >>>>>>> I guess this is my pet peeve when it comes to optimizing anything.
> >>>>>>>
> >>>>>>> First you optimize for readability (often gaining speed or at least
> >>>>>>> finding issues)
> >>>>>>> Then you measure
> >>>>>>> Then you optimize the hotspots you've discovered by measuring
> >>>>>>> *Then* you can start looking into hardware or non-Erlang solutions
> >>>>>>>
> >>>>>>> I've also seen situations where minor changes in the requirements
> >>>>>>> have
> >>>>>>> seen the possibility to speed up code by a factor 10 so that's
> also a
> >>>>>>> possibility.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> Daniel
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> erlang-questions mailing list
> >>>>>> erlang-questions@REDACTED
> >>>>>> http://erlang.org/mailman/listinfo/erlang-questions
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121215/84f09c72/attachment.htm>