<div dir="ltr">The interesting thing is that I didn't see any processes that have long message queues. During regular operation of the nodes, I rarely see more than 5 messages in the queue of the most loaded processes. During the period of high CPU usage this didn't change. This made me think that this is not a problem with regular Erlang code load, but with something else.<div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 23 June 2016 at 15:33, Jesper Louis Andersen <span dir="ltr"><<a href="mailto:jesper.louis.andersen@gmail.com" target="_blank">jesper.louis.andersen@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><span class=""><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jun 23, 2016 at 12:35 PM, Eli Iser <span dir="ltr"><<a href="mailto:eli.iser@gmail.com" target="_blank">eli.iser@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">* run_queue - affected nodes had a run queue of several dozens (less than 100), while un-affected nodes had 0 (always). Since run_queues is undocumented (at least I didn't see it in the documentation), I didn't run it at the time of the problem.</blockquote></div><br></div></span><div class="gmail_extra">A queues load can be seen as a (real) number K.<br><br></div><div class="gmail_extra">If K < 1 it means your system can dequeue messages faster then they arrive. This leads to a queue size of 0 over time.<br><br></div><div class="gmail_extra">If K = 1 it means that your system dequeues at the same rate as the arrival rate. This leads to a standing queue.<br><br></div><div class="gmail_extra">If K > 1 the queue will slowly fill up because the arrival rate is larger than the processing/dequeue-rate.<br><br></div><div class="gmail_extra">In your case, you are either in the K = 1 or the K > 1 situation for a while. This usually leads to more load on the system because there is more work to do. Note, however, that a 100% CPU load isn't necessarily a problem, unless response latencies are also affected. If you start a periodic background job which is CPU bound, this will take up all the free resources, but it will hopefully be scheduled out of the core whenever other work arrives to make way for faster processing.<br><br></div><div class="gmail_extra">In other words, you may want to figure out what happens inside the processes with the larger message queues, and what events could lead to the longer message queues. A common case is that there is a specific user or subsystem which invokes the situation through normal use. But the use hits an edge-case.<span class="HOEnZb"><font color="#888888"><br><br></font></span></div><span class="HOEnZb"><font color="#888888"><div class="gmail_extra"><br clear="all"><br>-- <br><div data-smartmail="gmail_signature">J.</div>

</div></font></span></div>

</blockquote></div><br></div>