[erlang-questions] Timeout Erlang GenServer Crash Loop

Fri Oct 12 18:40:22 CEST 2012

It's hard to answer without knowing what your code is doing (i.e. maybe there is an inefficiency somewhere). However, a common design pattern if your gen_server is doing complex work is to spawn another process to do this task. If you are running multicore the work will be distributed over the different cores:
e.g.
handle_call({some_operation,Data}, From, State) ->    spawn(fun() ->         Rsp = do_lots_of_work(),         gen_server:reply(From,Rsp)    end),    {noreply,State};

Date: Fri, 12 Oct 2012 00:31:13 -0700
From: mjtruog@REDACTED
To: codeithere@REDACTED
CC: erlang-questions@REDACTED
Subject: Re: [erlang-questions] Timeout Erlang GenServer Crash Loop

    Ok, if you are experiencing latency with file io, make sure you have
    async thread pool threads set on the Erlang VM with something like:

    erl +A 5

    If it is related to the async thread pool, the job queue is not
    shared between the threads... it is a queue per thread, so the size
    of the async thread pool can impact the wait time... meaning that
    file io can take longer if the async thread pool is smaller, but you
    normally don't need a large number of async threads started.

    If it is socket stuff, it might be related to the encoding, but
    there are many possibilities down that road.

    On 10/12/2012 12:15 AM, Code Box wrote:

      Thanks for your reply. I really appreciate it. I am sure i do have
      a lot of load on my server like few thousands requests per second.
      But the process getting time out is not waiting on any other
      process that call just does a . So I definitely think it is due to
      the reason that the process is overloaded and all the other
      requests to that process are in the process queue are getting time
      outs. I am trying to prove this looking at the Server metrics
      around CPU, Memory, IO Stats. Talking about IO Stats i do see a
      big spike in IO Stats. That could be the reason for other
      processes blocked till the IO is happening which can cause CPU
      contention.

            On Thu, Oct 11, 2012 at 10:18 PM, Michael Truog <mjtruog@REDACTED>
              wrote:

                 Well a common problem is to have the process also
                  blocked on its own synchronous call, so that can keep
                  the CPU usage low, since it is spending time mostly
                  idle waiting for 1 or more responses from some other
                  processes.  The best way I have seen to deal with this
                  type of timeout problem is to always pass the timeouts
                  in the message like this:

                  gen_server:call(<process>, {<message>,
                  Timeout - DELTA}, Timeout)

                  Where DELTA can be 100 milliseconds.  Then the
                  (Timeout-DELTA) value the handle_call sees can be used
                  for any internally synchronous calls.  However, then
                  the problem becomes understanding what the cumulative
                  delay might be, if there are multiple synchronous
                  calls used within the process.  Ideally, the process
                  is kept simpler, so it doesn't need to try and track
                  many synchronous calls.

                  I am not entirely sure if this is your problem, since
                  it could be latency due to function calls too, if
                  function calls are blocking schedulers or something
                  strange, code loading locking schedulers.  Usually
                  those issues aren't as common a concern though.

                      On 10/11/2012 10:05 PM, Code Box wrote:
                       Will it not relate to any
                        CPU Stats of my host and also any memory stats
                        of my host that the process is overloaded ? I
                        see CPU % usage as just 50% ?

                        On Thu, Oct 11, 2012 at 9:14 PM, Michael
                          Truog <mjtruog@REDACTED>
                          wrote:

                                 On 10/11/2012 09:03 PM, Code Box
                                  wrote:

                                    ** Reason for termination ==
                                    **
                                      {timeout,{gen_server,call,[thetime,gettime]}}

                                    =CRASH REPORT==== 2012-10-09
                                      05:37:04 UTC ===
                                      crasher:
                                        initial call:
                                      process_listener:-init/1-fun-2-/0
                                        pid: <0.12376.513>
                                        registered_name: []
                                        exception exit:
                                      {timeout,{gen_server,call,[thetime,gettime]}}
                                          in function
                                       gen_server:terminate/6
                                        ancestors:
                                      [incoming_req_processor,incoming_sup,top_process_sup,

                                      <0.52.0>]
                                        messages: []
                                        links: []
                                        dictionary:
                                      [{random_seed,{23375,22820,17046}}]
                                        trap_exit: true
                                        status: running
                                         heap_size: 6765
                                        stack_size: 24
                                        reductions: 1646842
                                      neighbours:

                                    I am seeing a lot of these
                                      messages in my Crash Reports. Once
                                      this reaches this it goes into
                                      this crash loop for quite a while.
                                      I am not sure how to debug this
                                      error. These timeouts are really
                                      annoying. Can some one help me
                                      understand the root cause of this?

                                    Why does my genserver calls are
                                      facing timeouts ? Is it that my
                                      erlang VM is slow if so why ? How
                                      can i debug this issue to get to
                                      the root cause of it ? 

                              If you look at gen_server:call/2 at http://www.erlang.org/doc/man/gen_server.html

                              it shows the default Timeout is 5000
                              milliseconds (5 seconds).  Your gen_server
                              process must have been processing for
                              longer than 5 seconds while a
                              gen_server:call/2 message was waiting in
                              the process message queue, to cause the
                              timeout exception.  So, it isn't the
                              Erlang VM being slow, it is just an Erlang
                              process that is overloaded (i.e., the
                              "thetime" registered process).

_______________________________________________
erlang-questions mailing list
erlang-questions@REDACTED
http://erlang.org/mailman/listinfo/erlang-questions 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121012/c385b6c6/attachment.htm>