[erlang-questions] time consuming operations inside gen_server
Wed Dec 12 14:23:26 CET 2012
I gave a quick read to this thread and there are a few things I think
should be mentioned in order to make a decision. I'm writing this as
some kind of general guide I follow mentally, so please do not feel
patronized if you find I approach things at a basic level that's too
simple for your level of expertise. I'm writing it for you, but also for
myself (or anyone else finding it over google or whatever).
I believe you won't solve this problem by leaving things as they are,
but there are properties to figure out regarding the kind of work you're
1. Is this queue build-up related to temporary overflow? Does it happen
at peak time, in bursts, or is it a continuous overflow?
2. Are the tasks you're running in any way bound by time? What I mean
here is to ask how long you're allowed to wait. Is it milliseconds,
seconds, or hours, before a cast is a problem?
3. Are you in charge of producing the events in-system, or is it
something triggered by user actions, outside of your control?
4. Why does it take long to process? Is it a problem due to CPU-bound
problems, depending on other workers, I/O bound problems (disk,
network) slowing your server down?
5. What's the nature of events you're handling?
Answering each of these questions will be the first step to being able
to pick an adequate solution. Here are a few possibilities:
- if you're in charge of producing events (your system creates them from
some static data source, for example) and can regulate them, by having
a fixed number of producers and synchronous calls to put back pressure
from your server to the workers. They won't do more work than the
consuming part of your system can handle.
In general terms, applying back-pressure this way is the most
efficient way to solve and survive all overload issues. It's a bit
tricky because it means you're pushing the problem up a level in your
stack, until at some point you lower your issues with pressure or
that at some point you push the backpressure back to users, and that's
sometimes not acceptable. Pushing it back to some load-balancing
mechanism that dispatches through more instances is often acceptable
as an alternative.
- You may expect tasks to be long to run, but to be fast to be
acknowledged. In this case, moving to an asynchronous model makes
sense. This can be done by spawning workers to do tasks while the
server simply accepts the queries, responds to them, and queues up
answers that have yet to come.
This form of concurrency is different from adding processes for the
sake of parallelism. We don't expect to handle more requests (or at
least more than Original*NumberOfCores), we just want to make the
accepting/response of events and theur handling disjoint, not
happening in the same timeline to avoid blocking.
- If requests are to be very short-lived, it may be interesting to just
not handle them when the system is very busy, and fail. This is a bit
more tricky to put in place, and I believe it's rarely a good solution
based on the nature of the problem you solve. Systems where you're
allowed to give up and not do handle things are a bit rare, I believe.
- If your handling of events is slow due to the task simply being long
to handle, you could try to figure the ideal rate at which you process
data and the overload you could handle in peak hours. This means
figuring out how many requests per second (or whatever time slice) you
can handle, how much you receive, and then finding a way to raise this
value through optimization, or adding more processes or more machines to
handle it. This is often a good way to proceed, as far as I know, when
you deal with predictable levels of overload or a constant level of
If the problem is being CPU-bound, then there's an upper
limit to how much parallelism will help you. Better algorithm or data
structure choice, going down to HiPE or C, or finally using more
computers to do the work can all be considered.
If it's network or disk bound, then you have plenty of different
options to try. SSDs, compressing data, buffering before pushing it
around, possibly merging events or dropping non-vital ones, adding
more end-points (similar to sharding) may all help reduce that cost.
- It's possible you have different kind of events, either from different
sources or to different endpoints. If that happens, it may be
interesting to quickly dispatch events from your central process to
workers dedicated to a source and/or an endpoint. This will naturally
divide the workload done and may solve your problem to some extent. If
what you get is extremely uneven distribution of events (for example,
95% of them are from one source to one endpoint), then divinding your
dispatching and handling is likely to only help about 5% of requests.
- Dropping out of OTP is a possibility, as Loïc suggested, but I would
personally only do it once you know for a fact OTP is one of the
bottleneck causing problems. I think OTP-by-default is a sane thing to
have, and while dropping down is always an option, I think you should
be able to debate and prove why it made sense to do it before doing
- If you manage to fix your sequential bottleneck, you'll possibly find
out you're creating a new one further down the system, up until you
either get rid of all of them, or you reach a point where you need to
apply back-pressure at a hard limit. This one is particularly painful
because it may mean parts of your hard work need to be undone to start
bubbling the back-pressure mechanisms up until a higher level.
It may be interesting to make sure you know the true underlying cause
of your problem there, to avoid optimizing towards a wall that way.
That's about what I can manage to think of this morning. I hope this
proves helpful to anybody out there and that I didn't insert to many
typos or errors.
On 12/12, Martin Dimitrov wrote:
> Hi all,
> In our application, we have a gen_server that does a time consuming
> operation. The message is of type cast thus the caller process doesn't
> sit and wait for the operation to finish. But while the gen_server is
> busy with the cast message, it doesn't serve any other call, right?
> So, would it be appropriate to create a process that will do the time
> consuming operation and then notify the gen_server?
> Thanks for looking into this.
> Best regards,
> erlang-questions mailing list
More information about the erlang-questions