[erlang-questions] Kill process if message mailbox reaches a certain size (was discarding signals)
Thu Jun 16 07:09:04 CEST 2011
I sense a few misconceptions here.
First, I assume that the proposal for "kill and restart" is to kill and
restart the particular process that has an inbox that is "too full." This
does not mean that the client crashes -- it means that those messages are
lost. This is no different to how Erlang systems are robust in the face of
rare crashing bugs (as opposed to, say, C++, where generally the entire
system goes down because of a rare stray pointer bug, for example).
Basically, crashing and re-starting an Erlang worker process is just one way
of clearing out the message queue, and also making sure that any possible
state corruption goes away because the process re-starts "afresh." The
Erlang/OTP supervision tree is designed to work in this mode.
Second, when you have an amount of load that comes in, and you cannot
control it, then what is generally done is to simply model the load, model
the application, and provision enough server hardware that you can keep up
with the load. In an emergency (an unexpected surge that doubles load
compared to anything seen before), you'll additionally want capabilities to
reject some part of the incoming requests. For HTTP, this is where status
503 (Server Busy) comes in, for example. I'm assuming all your clients use
some common protocol, like TCP or HTTPS or whatever, and that you do
appropriate protection against un-trusted data at that layer.
When it comes to cluster sizes, we're running a 11 node cluster with
>100,000 users and it's running mostly idle on a gigabit switch. I would
consider this a "smallish cluster." We're planning on increasing our data
rates a lot in the future, though -- at some point, we'll need to provision
to 10 GBps. We scale using a crossbar and consistent hasning.
I've heard of clusters that do a million users per node, and use broadcast
to all other nodes in a cluster of 50 nodes. That also scales on available
networking hardware, as long as most users are not generators of large or
I would advise against single-core nodes or cloud-based nodes that don't
have local networking, because these get much less work done per node (and
per network packet) than larger systems. Buying a single server from Dell
today, you get 12 cores and 24 hardware threads even on the low end. Next
year, that number will be 40, 80 or even 160 (for the higher end).
So, in your case, I would suggest making sure that you know what the
protocol is that clients use to connect to the server, and then making sure
that you have some way of reporting temporary capacity overload to the
clients, and then making sure that you have good metrics on the utilization
of the server cluster (CPU, memory, network bandwidth, etc) so that you can
put in more hardware when needed. If CPU goes 100% for a long time (which
would be a precondition for a queue to fill up), start rejecting requests
and log an alert for the operator to buy more hardware.
I also recommend modeling the traffic across the backplane of the Erlang
nodes. How much data do you send per user "event" to other users, and how
many other users? Broadcast or point-to-point? Sum it all up, double it, and
see if you can still swing that on your current network backplane. If not,
buy a bigger network, or start working on ways to compress/reduce the data
Americans might object: there is no way we would sacrifice our living
standards for the benefit of people in the rest of the world. Nevertheless,
whether we get there willingly or not, we shall soon have lower consumption
rates, because our present rates are unsustainable.
On Wed, Jun 15, 2011 at 8:11 PM, József Bérces
> Thanks for all the thoughts and suggestions. If I got it right, there were
> two main branches:
> 1. Avoid the congestion situation
> 2. Detect and kill/restart the problematic process(es)
> The problem with these approaches that the Erlang applications are not just
> playing with themselves but receive input from other nodes. Those nodes can
> be very numerous and uncontrollable.
> As an example, just let's take the mobile network where the traffic is
> generated by millions of subscribers using mobile devices from many vendors.
> In this case we (1) cannot control the volume of the traffic and (2) cannot
> make sure that all the devices follow the protocol.
> So there can be situations when we cannot avoid congestion simply because
> the source of the traffic is beyond our reach.
> Killing and restarting is not the right way either:
> - A restart causes total outage for a while that is very unwelcome by the
> users (e.g. network operators) of our boxes
> - Erlang is advertised to be robust but killing and restarting is not a
> sign of robustness.
> So the user can easily call us liar: "You say your node is robust but it
> is restarting frequently!"
> So I still believe that very quick discard of the signals is a key for real
> robustness. Obviously, it shall be used *only* in the right circumstances,
> but in those cases that would be the only way to keep the node alive and
> minimize the traffic loss.
> Then the question is still open: Discarding 1-by-1 is the best what we can
> do or there is something more efficient to get rid of the excess traffic?
> -----Original Message-----
> From: [mailto:
> ] On Behalf Of Max Lapshin
> Sent: Thursday, June 16, 2011 0:25
> To: Mihai Balea
> Cc: erlang-questions Questions
> Subject: Re: [erlang-questions] Kill process if message mailbox reaches a
> certain size (was discarding signals)
> On Wed, Jun 15, 2011 at 8:32 PM, Mihai Balea <> wrote:
> > Are you referring to this thread?
> > http://erlang.2086793.n4.nabble.com/Why-Beam-smp-crashes-when-memory-i
> > s-over-tt2118397.html#none
> erlang-questions mailing list
> erlang-questions mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions