[erlang-questions] +swt very_low doesn't seem to avoid schedulers getting

Thu Oct 18 20:13:00 CEST 2012

On 10/16/2012 06:20 AM, Scott Lystig Fritchie wrote:
> Rickard Green <rickard@REDACTED> wrote:
>
> rg> This is very much expected. Since you have work that can load 2
> rg> schedulers full time and shuts down all but one, the run-queue will
> rg> grow. When you later release all schedulers, there will be lots of
> rg> work to pick from.
>
> Hi, Rickard.  Sorry I didn't reply earlier ... Basho kept me busy with
> an all-hands meeting and conference out in San Francisco.
>
> Perhaps I wasn't all that clear about the problem that I saw and that
> several other customers have witnessed.
>
> 1. One node in a Riak cluster is consuming significantly lower CPU than
>     the other nodes in the cluster.  The imbalance is not due to
>     application layer workload imbalance, as far as we can tell.  (In the
>     case that I personally witnessed, it was a lab environment with an
>     artificial & deterministic load generator talking to all Riak nodes
>     equally (or trying very hard to)).
>
> 2. As soon as we do one of two things, the CPU imbalance disappears:
>      a. Restart the Riak app on the slow node.
>      b. Use the erlang:system_flag(schedulers_online, 1) hack and then
>         back to 8 using the same BIF.
>
> In situations described by customers, this seems to happen after a day
> or more of load, where the peak workload is substantially higher than
> off-peak workload.  In the lab environment that I witnessed, the load
> generators were cycling through 100%-off and 100%-on states.
>
> rg> This compaction of load onto fewer schedulers is there in order to
> rg> reduce communication overhead when there aren't enough work to fully
> rg> utilize all schedulers. The performance gain of this compaction
> rg> depends on the hardware.
>
> What you describe seeems to be exactly what's happening ... except that
> when input workload rises again, the idled schedulers aren't waking up,
> ever.  Or we force them to wake up with the system_flag() BIF.
>
> rg> We have gotten reports about problems with this functionality, but
> rg> we have not found any bugs in this functionality. We have only found
> rg> that it behaves as expected. That is, if more schedulers aren't
> rg> woken this is due to not enough accumulated overload. The +swt
> rg> switch was introduced in order to give the user the possibility do
> rg> define what is enough overload for his or her taste.
>
> Hrm, well, we've seen it both with "+swt very_low" and without any +swt
> flag at all.  And it's extremely irritating.  :-)
>

I can of course not guarantee that there isn't a hard to find bug here, 
but based on previous experience of similar situations and actually 
having tried hard to find a bug here, my guess is still that there 
actually are not appearing much more work for schedulers to do, and that 
this is the reason to why no more schedulers are woken up. The reason 
for this is more or less impossible for me to guess, since I don't know 
the implementation of your application.

My recommendation is to try out the "+sws proposal +swt very_low" combo 
in R15B02.

> What info would you need gathered from the field when this bugaboo
> strikes next time?
>

Currently, one needs to hack the emulator in order to pull out the most 
important information, and, unfortunately, I don't have the time to do this.

However, you can call "statistics(run_queues)" (note that the argument 
should be 'run_queues', and not 'run_queue') repeatedly, say once every 
100 ms (perhaps even more frequent than this) for say 10 seconds when 
the system is in this state . That information will at least give us a 
good hunch of what is going on. statistics(run_queues) returns a tuple 
containing the run queue length of each run queue as elements.

Regards,
Rickard
-- 
Rickard Green, Erlang/OTP, Ericsson AB.