[erlang-questions] time consuming operations inside gen_server

Wed Dec 12 14:23:26 CET 2012

Hi Martin,

I gave a quick read to this thread and there are a few things I think
should be mentioned in order to make a decision. I'm writing this as
some kind of general guide I follow mentally, so please do not feel
patronized if you find I approach things at a basic level that's too
simple for your level of expertise. I'm writing it for you, but also for
myself (or anyone else finding it over google or whatever).

I believe you won't solve this problem by leaving things as they are,
but there are properties to figure out regarding the kind of work you're
doing:

1. Is this queue build-up related to temporary overflow? Does it happen
   at peak time, in bursts, or is it a continuous overflow?
2. Are the tasks you're running in any way bound by time? What I mean
   here is to ask how long you're allowed to wait. Is it milliseconds,
   seconds, or hours, before a cast is a problem?
3. Are you in charge of producing the events in-system, or is it
   something triggered by user actions, outside of your control?
4. Why does it take long to process? Is it a problem due to CPU-bound
   problems, depending on other workers, I/O bound problems (disk,
   network) slowing your server down?
5. What's the nature of events you're handling?

Answering each of these questions will be the first step to being able
to pick an adequate solution. Here are a few possibilities:

- if you're in charge of producing events (your system creates them from
  some static data source, for example) and can regulate them, by having
  a fixed number of producers and synchronous calls to put back pressure
  from your server to the workers. They won't do more work than the
  consuming part of your system can handle.

  In general terms, applying back-pressure this way is the most
  efficient way to solve and survive all overload issues. It's a bit
  tricky because it means you're pushing the problem up a level in your
  stack, until at some point you lower your issues with pressure or
  that at some point you push the backpressure back to users, and that's
  sometimes not acceptable. Pushing it back to some load-balancing
  mechanism that dispatches through more instances is often acceptable
  as an alternative.

- You may expect tasks to be long to run, but to be fast to be
  acknowledged. In this case, moving to an asynchronous model makes
  sense. This can be done by spawning workers to do tasks while the
  server simply accepts the queries, responds to them, and queues up
  answers that have yet to come.

  This form of concurrency is different from adding processes for the
  sake of parallelism. We don't expect to handle more requests (or at
  least more than Original*NumberOfCores), we just want to make the
  accepting/response of events and theur handling disjoint, not
  happening in the same timeline to avoid blocking.

- If requests are to be very short-lived, it may be interesting to just
  not handle them when the system is very busy, and fail. This is a bit
  more tricky to put in place, and I believe it's rarely a good solution
  based on the nature of the problem you solve. Systems where you're
  allowed to give up and not do handle things are a bit rare, I believe.

- If your handling of events is slow due to the task simply being long
  to handle, you could try to figure the ideal rate at which you process
  data and the overload you could handle in peak hours. This means
  figuring out how many requests per second (or whatever time slice) you
  can handle, how much you receive, and then finding a way to raise this
  value through optimization, or adding more processes or more machines to
  handle it. This is often a good way to proceed, as far as I know, when
  you deal with predictable levels of overload or a constant level of
  overload.

  If the problem is being CPU-bound, then there's an upper
  limit to how much parallelism will help you. Better algorithm or data
  structure choice, going down to HiPE or C, or finally using more
  computers to do the work can all be considered.

  If it's network or disk bound, then you have plenty of different
  options to try. SSDs, compressing data, buffering before pushing it
  around, possibly merging events or dropping non-vital ones, adding
  more end-points (similar to sharding) may all help reduce that cost.

- It's possible you have different kind of events, either from different
  sources or to different endpoints. If that happens, it may be
  interesting to quickly dispatch events from your central process to
  workers dedicated to a source and/or an endpoint. This will naturally
  divide the workload done and may solve your problem to some extent. If
  what you get is extremely uneven distribution of events (for example,
  95% of them are from one source to one endpoint), then divinding your
  dispatching and handling is likely to only help about 5% of requests.

- Dropping out of OTP is a possibility, as Loïc suggested, but I would
  personally only do it once you know for a fact OTP is one of the
  bottleneck causing problems. I think OTP-by-default is a sane thing to
  have, and while dropping down is always an option, I think you should
  be able to debate and prove why it made sense to do it before doing
  it.

- If you manage to fix your sequential bottleneck, you'll possibly find
  out you're creating a new one further down the system, up until you
  either get rid of all of them, or you reach a point where you need to
  apply back-pressure at a hard limit. This one is particularly painful
  because it may mean parts of your hard work need to be undone to start
  bubbling the back-pressure mechanisms up until a higher level.

  It may be interesting to make sure you know the true underlying cause
  of your problem there, to avoid optimizing towards a wall that way.

That's about what I can manage to think of this morning. I hope this
proves helpful to anybody out there and that I didn't insert to many
typos or errors.

Regards,
Fred.

On 12/12, Martin Dimitrov wrote:
> Hi all,
> 
> In our application, we have a gen_server that does a time consuming
> operation. The message is of type cast thus the caller process doesn't
> sit and wait for the operation to finish. But while the gen_server is
> busy with the cast message, it doesn't serve any other call, right?
> 
> So, would it be appropriate to create a process that will do the time
> consuming operation and then notify the gen_server?
> 
> Thanks for looking into this.
> 
> Best regards,
> Martin
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions