<div>I sense a few misconceptions here.</div><div> </div><div>First, I assume that the proposal for "kill and restart" is to kill and restart the particular process that has an inbox that is "too full." This does not mean that the client crashes -- it means that those messages are lost. This is no different to how Erlang systems are robust in the face of rare crashing bugs (as opposed to, say, C++, where generally the entire system goes down because of a rare stray pointer bug, for example). Basically, crashing and re-starting an Erlang worker process is just one way of clearing out the message queue, and also making sure that any possible state corruption goes away because the process re-starts "afresh." The Erlang/OTP supervision tree is designed to work in this mode.</div>

<div> </div><div>Second, when you have an amount of load that comes in, and you cannot control it, then what is generally done is to simply model the load, model the application, and provision enough server hardware that you can keep up with the load. In an emergency (an unexpected surge that doubles load compared to anything seen before), you'll additionally want capabilities to reject some part of the incoming requests. For HTTP, this is where status 503 (Server Busy) comes in, for example. I'm assuming all your clients use some common protocol, like TCP or HTTPS or whatever, and that you do appropriate protection against un-trusted data at that layer.</div>

<div> </div><div>When it comes to cluster sizes, we're running a 11 node cluster with >100,000 users and it's running mostly idle on a gigabit switch. I would consider this a "smallish cluster." We're planning on increasing our data rates a lot in the future, though -- at some point, we'll need to provision to 10 GBps. We scale using a crossbar and consistent hasning.</div>

<div>I've heard of clusters that do a million users per node, and use broadcast to all other nodes in a cluster of 50 nodes. That also scales on available networking hardware, as long as most users are not generators of large or frequent packets.</div>

<div>I would advise against single-core nodes or cloud-based nodes that don't have local networking, because these get much less work done per node (and per network packet) than larger systems. Buying a single server from Dell today, you get 12 cores and 24 hardware threads even on the low end. Next year, that number will be 40, 80 or even 160 (for the higher end).</div>

<div> </div><div>So, in your case, I would suggest making sure that you know what the protocol is that clients use to connect to the server, and then making sure that you have some way of reporting temporary capacity overload to the clients, and then making sure that you have good metrics on the utilization of the server cluster (CPU, memory, network bandwidth, etc) so that you can put in more hardware when needed. If CPU goes 100% for a long time (which would be a precondition for a queue to fill up), start rejecting requests and log an alert for the operator to buy more hardware.</div>

<div>I also recommend modeling the traffic across the backplane of the Erlang nodes. How much data do you send per user "event" to other users, and how many other users? Broadcast or point-to-point? Sum it all up, double it, and see if you can still swing that on your current network backplane. If not, buy a bigger network, or start working on ways to compress/reduce the data stream :-)</div>

<div> </div><div>Sincerely,</div><div> </div><div>jw</div><div><br clear="all"><br>--<br>Americans might object: there is no way we would sacrifice our living standards for the benefit of people in the rest of the world. Nevertheless, whether we get there willingly or not, we shall soon have lower consumption rates, because our present rates are unsustainable. <br>

<br>

<br><br></div><div class="gmail_quote">On Wed, Jun 15, 2011 at 8:11 PM, József Bérces <span dir="ltr"><<a href="mailto:jozsef.berces@ericsson.com">jozsef.berces@ericsson.com</a>></span> wrote:<br><blockquote style="margin: 0px 0px 0px 0.8ex; padding-left: 1ex; border-left-color: rgb(204, 204, 204); border-left-width: 1px; border-left-style: solid;" class="gmail_quote">

Thanks for all the thoughts and suggestions. If I got it right, there were two main branches:<br>

<br>

1. Avoid the congestion situation<br>

2. Detect and kill/restart the problematic process(es)<br>

<br>

The problem with these approaches that the Erlang applications are not just playing with themselves but receive input from other nodes. Those nodes can be very numerous and uncontrollable.<br>

<br>

As an example, just let's take the mobile network where the traffic is generated by millions of subscribers using mobile devices from many vendors. In this case we (1) cannot control the volume of the traffic and (2) cannot make sure that all the devices follow the protocol.<br>


So there can be situations when we cannot avoid congestion simply because the source of the traffic is beyond our reach.<br>

<br>

Killing and restarting is not the right way either:<br>

- A restart causes total outage for a while that is very unwelcome by the users (e.g. network operators) of our boxes<br>

- Erlang is advertised to be robust but killing and restarting is not a sign of robustness.<br>

  So the user can easily call us liar: "You say your node is robust but it is restarting frequently!"<br>

<br>

So I still believe that very quick discard of the signals is a key for real robustness. Obviously, it shall be used *only* in the right circumstances, but in those cases that would be the only way to keep the node alive and minimize the traffic loss.<br>


<br>

Then the question is still open: Discarding 1-by-1 is the best what we can do or there is something more efficient to get rid of the excess traffic?<br>

<div><div></div><div class="h5"><br>

<br>

-----Original Message-----<br>

From: <a href="mailto:erlang-questions-bounces@erlang.org">erlang-questions-bounces@erlang.org</a> [mailto:<a href="mailto:erlang-questions-bounces@erlang.org">erlang-questions-bounces@erlang.org</a>] On Behalf Of Max Lapshin<br>


Sent: Thursday, June 16, 2011 0:25<br>

To: Mihai Balea<br>

Cc: erlang-questions Questions<br>

Subject: Re: [erlang-questions] Kill process if message mailbox reaches a certain size (was discarding signals)<br>

<br>

On Wed, Jun 15, 2011 at 8:32 PM, Mihai Balea <<a href="mailto:mihai@hates.ms">mihai@hates.ms</a>> wrote:<br>

<br>

> Are you referring to this thread?<br>

><br>

> <a href="http://erlang.2086793.n4.nabble.com/Why-Beam-smp-crashes-when-memory-i" target="_blank">http://erlang.2086793.n4.nabble.com/Why-Beam-smp-crashes-when-memory-i</a><br>

> s-over-tt2118397.html#none<br>

<br>

exactly<br>

_______________________________________________<br>

erlang-questions mailing list<br>

<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

_______________________________________________<br>

erlang-questions mailing list<br>

<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

</div></div></blockquote></div><br>