[erlang-questions] +swt very_low doesn't seem to avoid schedulers getting "stuck"

Fri Oct 5 06:01:54 CEST 2012

Hi, all.  According to my private mailing list archive, there hasn't
been much mention of erl's "+swt" flag since about April 2012.

I just witnessed a case of where using "+swt very_low" with Riak
1.2.1rc2, using Erlang/OTP R15B01, get "stuck" where CPU consumption was
only 200% on an 8 core AWS instance.  The other nodes in that Riak
cluster were running at over 600% CPU utilization (on average).

When I ran this:

    {io:format("before..."), erlang:system_flag(schedulers_online, 1), 
     timer:sleep(1000), erlang:system_flag(schedulers_online, 8),
     io:format("after\n")}.

... then average CPU utilization on that node immediately shot up from
200% to about 760%.

I'd heard a rumor that "+swt very_low" was supposed to avoid whatever
weird scheduler problem/bug that caused some schedulers to appear as if
they weren't active.  But I was on site today and witnessed this
first-hand and verified that the "+swt very_low" flag was indeed being
used.

I'm not certain what exact Linux distribution and kernel was used.  I'll
ask the customer to send me that info so I can forward it to the list.

Has anyone else seen this behavior?  Unlike Knut Nesheim's report on
this list back in February 2012, Riak does not use the halfword
emulator.  We are using some NIFs, but this customer isn't using the
most evil one, the Riak eleveldb NIF library.  Instead, they're using
the Bitcask backend (which has a NIF component but isn't as evil as
eleveldb's NIF) and the merge_index backend (which is pure Erlang).

-Scott