[erlang-patches] Add supervisor:start_child/3 to limit the number of children

Wed Apr 10 14:51:11 CEST 2013

On 04/10, Vance Shipley wrote:
>   
> In my case we are creating a child worker process to manage the lifecycle 
> of a transaction.  We process thousands of transactions per second.  A 
> transaction may take tens of milliseconds or tens of seconds.  We require
> a limit on the number of possible ongoing transactions.
> 

Have you considered using ETS counters, and possibly a monitor process?
The idea being that if you have thousands of connections, trying to
increment an ETS counter outside of the supervision structure?

In my experience with whatever ended up being high throughput or low
latency, what could kill you was not the fact that the counter was
necessarily high, but how much contention there is to it.

If you're in the kind of position where you need to limit the number of
transactions to avoid falling over, it will *not* reduce the number of
messages sent to the supervisor, and if you start going over the top,
you'll time out no matter what, just because the supervisor won't be
able to keep up with the demand.

It takes a while before reaching that level, but in these cases, what I
end up doing most of the time is holding an ETS counter that maintains
itself at most to the max level given. Increment the counter as an
atomic operation (a write operation that also reads, so you benefit from
{read_concurrency,true} as an option). Assuming an entry of the form
{transactions, N}:

    -spec can_start(ets:tid()) -> boolean().
    can_start(Table) ->
        %% the counter should start at 0 when initiating things
        MaxValue = application:get_env(your_app, max_trans),
        MaxValue > ets:update_counter(Table,
                                      transactions,
                                      {2, 1, MaxValue, MaxValue}).

Using that command, the max value will be easily configurable, will keep
a ceiling set to the max value in there, and will be much, much faster
to deny (and accept) requests while keeping your supervisor less loaded.

Now what you'll need is a monitor process that will be able to decrement
the counter for you when you're done, but only with processes that
managed to get started. The management stuff can forget all about the
processes that couldn't get in there. In practices, it works very well,
and I've used a similar architecture for dispcount
(https://github.com/ferd/dispcount), which has been used in production
for over a year for low-latency scenarios. Now dispcount uses a fixed
pool size and *is* a pool, but the same mechanisms can be applied to a
more central system where one main counter is used.

This will, in my experience, be more scalable as an approach than
modifying supervisors' internal state and relying on it. In the
benchmarks we ran at the job where I wrote dispcount, a single process
could chug on maybe 9000 messages a second before starting to get
swamped and using more resources than necessary (I can't remember what
hardware I used for the benchmark). Using the ETS approach on the same
hardware, I wasn't able to even get to the point where it was
problematic -- allocation of processes to generate contention and
gathering statistics turned out to be a bigger bottleneck.

That's without counting that using ETS counters, getting a response back
was a matter of microseconds, or had peak times under 5ms. Using
messages, it was very easy to see roundtrip times well above 70ms, and
those were with dedicated processes, not processes like supervisors that
also need to do a lot of other stuff.

As I said, it is more scalable and more performant. It is, however, not
available out of the box.

Regards,
Fred.