[erlang-questions] gen_server bottleneck

Thu Dec 20 17:38:45 CET 2012

A couple of quick comments:

Perhaps you don't really have much parallel execution if everyone is waiting on a go on the main simulator process. This could explain the small msg queue. A call is synchronous, so it incurs lots and lots of waiting.

io:wait_mon_reply suggests you are outputting debug information. Don't. It is slowing you down. Use tracing instead if you need to dig into code.

If the storage in gb_trees are very small values, consider storing them outside the process heap in ETS and use ets:lookup_element/3 to pick out the value. This avoids GC'ing and allocation in the simulator and it is often faster.

do try to run eprof on the simulator to verify it uses all its time in gb_trees functions.

The plists could have been passed a very large data structure it then keeps in memory. eprof it, or ask about its current state. Also consider not passing around data if it can mostly be stored in ETS or you can keep a process alive with the data in it. Passing around large structures is rather expensive when copying is involved, so it is often better to have a process play the role of the data.

Jesper Louis Andersen
  Erlang Solutions Ltd., Copenhagen

On Dec 20, 2012, at 11:34 AM, Saravanan Vijayakumaran <sarva.v@REDACTED> wrote:

> I tried running etop to profile the simulation. The commands I used and a snapshot of the etop data are below. The nsime_simulator process is undergoing a lot of reductions. But the message queue is zero. I was expecting to see a lot of messages there because 10,000 event schedule calls each result in a gen_server:call to nsime_simulator. The nsime_simulator contains the event list gb_trees and hence it occupies a lot of memory. So that is not surprising. The plists module is used to run the simultaneous events in parallel. It is occupying the maximum memory. I don't understand how it can occupy more memory than the nsime_simulator itself. I will have to look at the code again to figure this out. I will keep you posted. Thanks for all your suggestions.
> 
> Simulation command
> erl -sname try -noshell +P 1000000 -pa ebin/ -pa examples/ebin/ -run udp_cs_pairs start 10000 -run init stop
> 
> etop command:
> /usr/lib/erlang/lib/observer-0.9.10/priv/bin/getop -node try@REDACTED -tracing off
> 
> 
> ========================================================================================
>  'try@REDACTED'                                                           10:26:35
>  Load:  cpu       177               Memory:  total      784544    binary     105148
>         procs  130041                        processes  664962    code         1888
>         runq        4                        atom          143    ets           144
> 
> Pid            Name or Initial Func    Time    Reds  Memory    MsgQ Current Function
> ----------------------------------------------------------------------------------------
> <4696.2.0>     erlang:apply/2           '-'   70800 13799852       0 plists:cluster_runmany
> <4696.35.0>    nsime_simulator          '-' 2315977 10471532       0 gen_server:loop/6   
> <4696.31801.3> erlang:apply/2           '-'   70568   556908       1 gen:do_call/4       
> <4696.31804.3> erlang:apply/2           '-'   77260   344376       1 io:wait_io_mon_reply
> <4696.31800.3> erlang:apply/2           '-'   72686   344360       0 io:wait_io_mon_reply
> <4696.31802.3> erlang:apply/2           '-'   75552   344360       0 io:wait_io_mon_reply
> <4696.31803.3> erlang:apply/2           '-'   74720   344360       0 io:wait_io_mon_reply
> <4696.37.0>    nsime_node_list          '-'       0   185928       0 gen_server:loop/6   
> <4696.25.0>    code_server              '-'       0   142164       0 code_server:loop/1  
> <4696.38.0>    nsime_channel_list       '-'       0   142144       0 gen_server:loop/6   
> ========================================================================================
> 
> I also get a bunch of messages in the etop window if I don't set "-tracing off".
> 
> Erlang top dropped data 151
> Erlang top got garbage {trace_ts,<4695.18979.3>,out,
>                                  {gen_server,loop,6},
>                                  {1355,998744,703693}}
> Erlang top got garbage {trace_ts,<4695.8399.1>,out,
>                                  {gen,do_call,4},
>                                  {1355,998744,703731}}
> 
> 
> 
> On Sun, Dec 16, 2012 at 7:46 PM, Jesper Louis Andersen <jesper.louis.andersen@REDACTED> wrote:
> 
> On Dec 15, 2012, at 9:19 PM, Garrett Smith <g@REDACTED> wrote:
> 
> > I've lost track of this thread -- are you still guessing, or have you
> > spent any time measuring? I looked back and couldn't see any
> > discussion about e.g. specific processes with high reduction counts,
> > growing queue sizes, etc.
> 
> Sometimes, a well-placed eprof measurement does wonders to tell you what is slow. We had a system that was spending 40% of its time in a specific regular expression once, but this is hard to see unless you profile. I'd say you should look at processes with high reduction counts, or groups of those and then look into why they are reducing all the time.
> 
> But I have a hunch that this has more to do with the fact that the simulation is serial more than parallel. If you want to utilize more than one core then, you need to figure out how the problem can be split up.
> 
> Jesper Louis Andersen
>   Erlang Solutions Ltd., Copenhagen
> 
> 
> 
>