[erlang-questions] +swt very_low doesn't seem to avoid schedulers getting

Wed Oct 10 21:12:23 CEST 2012

> Hi, all.  According to my private mailing list archive, there hasn't
> been much mention of erl's "+swt" flag since about April 2012.
>
> I just witnessed a case of where using "+swt very_low" with Riak
> 1.2.1rc2, using Erlang/OTP R15B01, get "stuck" where CPU consumption was
> only 200% on an 8 core AWS instance.  The other nodes in that Riak
> cluster were running at over 600% CPU utilization (on average).
>
> When I ran this:
>
>     {io:format("before..."), erlang:system_flag(schedulers_online, 1),
>      timer:sleep(1000), erlang:system_flag(schedulers_online, 8),
>      io:format("after\n")}.
>
> ... then average CPU utilization on that node immediately shot up from
> 200% to about 760%.

This is very much expected. Since you have work that can load 2 
schedulers full time and shuts down all but one, the run-queue will 
grow. When you later release all schedulers, there will be lots of work 
to pick from.

>
> I'd heard a rumor that "+swt very_low" was supposed to avoid whatever
> weird scheduler problem/bug that caused some schedulers to appear as if
> they weren't active.  But I was on site today and witnessed this
> first-hand and verified that the "+swt very_low" flag was indeed being
> used.
>

The runtime system tries to compact the load on as few schedulers as 
possible without getting run-queues that build up. The runtime system 
wont wake up new schedulers unless some overload has accumulated. This 
overload either show up as a quickly growing run-queue or a small 
run-queue over a longer time. The +swt flags sets the threshold that is 
used for determining when enough overload has accumulated to wake up 
another scheduler.

This compaction of load onto fewer schedulers is there in order to 
reduce communication overhead when there aren't enough work to fully 
utilize all schedulers. The performance gain of this compaction depends 
on the hardware.

We have gotten reports about problems with this functionality, but we 
have not found any bugs in this functionality. We have only found that 
it behaves as expected. That is, if more schedulers aren't woken this is 
due to not enough accumulated overload. The +swt switch was introduced 
in order to give the user the possibility do define what is enough 
overload for his or her taste.

The currently used wakeup strategy is very quick to forget about 
previously accumulated overload that has disappeared. Maybe even too 
quick for my taste when "+swt very_low" is used. I've therefore 
implemented an alternative strategy that most likely will be the default 
in R16. As of R15B02 you can try this strategy out by passing "+sws 
proposal" as a command line argument. In combination with "+swt 
very_low", the runtime system should be even more eager to wake up 
schedulers than when only using "+swt very_low".

> I'm not certain what exact Linux distribution and kernel was used.  I'll
> ask the customer to send me that info so I can forward it to the list.
>
> Has anyone else seen this behavior?  Unlike Knut Nesheim's report on
> this list back in February 2012, Riak does not use the halfword
> emulator.  We are using some NIFs, but this customer isn't using the
> most evil one, the Riak eleveldb NIF library.  Instead, they're using
> the Bitcask backend (which has a NIF component but isn't as evil as
> eleveldb's NIF) and the merge_index backend (which is pure Erlang).
>
> -Scott

Regards,
Rickard
-- 
Rickard Green, Erlang/OTP, Ericsson AB.