[erlang-questions] Slow when using chained gen_server:call's, redesign or optimize?

Sun Jan 29 21:59:03 CET 2012

# Jesper Louis Andersen 2012-01-29:
> On 1/29/12 12:13 AM, Jachym Holecek wrote:
> >>-export([write/1, read/1]).
> >>
> >>write(Obj) ->
> >>   call({write, Obj}).
> >>
> >>call(M) ->
> >>   gen_server:call(?SERVER, M, infinity).
> >
> >   3. No infinite timeouts without very good justification! You're
> >      sacrificing a good default protective measure for no good
> >      reason...
> >
> The question here is really one of what you want to do if a write
> takes more than the default timeout of 5 seconds. You can either
> decide to tell the writer that the cache is clogged up at the moment
> and the abstain from caching the result, you can crash, or you can
> wait around like the above code does. What the right thing to do in
> a situation depends a bit on what semantics you want.

True.

> Perhaps my reaction for a cache was to knee-jerky. You might want a
> call to fail if it takes too long to process because it will uncover
> another problem in the code: namely overload of the cache.

And it will also release the resources held by waiting process, that's
my main concern. What came to mind immediately when I saw the infinite
timeout was a typical (IMO) scenario when one might be tempted to use
those:

  %% This could be a load-balancer or failover manager process that
  %% acts as entry point to a protocol stack, or somesuch thing.

  send_req(Pid, Req, Timeout) ->
      gen_server:call(Pid, {send_req, Req, Timeout}, infinity).

  handle_call({send_req, Req, Timeout}, From, #state{parties = Ps} = State) ->
      party:send_req(choose_party(Ps), {send_req, Req, From, Timeout}),
      {noreply, State};

Where party module would perhaps do some more delegation of its own and
eventually request ends up sent out to an external system, timeout gets
planned and its reference recorded along with From in and ETS table.
When either timeout triggers or response arrives it gets correlated
against ETS and gen_server:reply/2 is called.

Now if something goes wrong and that ETS table evaporates or a low-level
process explodes and the error, by mistake, isn't propagated correctly
(which would involve faulting all pending requests immediately), one is
left with the client process sitting there forever. Sure, gen_server:call/X
isn't stupid and monitors the server process -- but given the amount of
delegation we have going on, that one may very well still be alive and
doing well. Now over time these zombie processes could add u, and whole
node crashes.

This is a somewhat elaborate scenario and depends on a suitable bug being
already present somewhere in the system, or perhaps just some unfortunate
timing in otherwise reasonably designed system. But let's consider trivial
change, everything else being the same:

  send_req(Pid, Req, Timeout) ->
      gen_server:call(Pid, {send_req, Req, scale_down(Timeout)}, Timeout).

  scale_down(N) ->
      %% This could involve both low and high internal processing overhead
      %% allowance cutoff if one wanted to be super-correct about this.
      round(N * 0.90).

Bugs or not and timing or not we now have a hard deadline on resource
release and can sleep a bit more peacefully at night without nightmares
of zombie apocalypse. :-) Furthermore, this valuable behavioral contract
is immediately apparent during code inspection, making things easier to
reason about.

Hopefully this sort of context clarifies my (possibly overly terse)
response.

BR,
	-- Jachym