[erlang-questions] +swt very_low doesn't seem to avoid schedulers getting

Rickard Green rickard@REDACTED
Sun Nov 4 02:05:32 CET 2012


This does, however, not show that anything is wrong. The statistics only show that a couple of hundred processes were selected for execution on the same scheduler during some timeframe. If there are no buildup in run-queue length, everything is as it should.

Regards,
Rickard

On Nov 3, 2012, at 11:36 PM, Scott Lystig Fritchie wrote:

> A few weeks ago, Rickard Green <rickard@REDACTED> wrote:
> 
>> However, you can call "statistics(run_queues)" (note that the argument 
>> should be 'run_queues', and not 'run_queue') repeatedly, say once every 
>> 100 ms (perhaps even more frequent than this) for say 10 seconds when 
>> the system is in this state . That information will at least give us a 
>> good hunch of what is going on. statistics(run_queues) returns a tuple 
>> containing the run queue length of each run queue as elements.
> 
> Hiya.  We have some new data from three customer machines running
> Riak 1.2.1 with R15B01 that all hit what appears to be this same
> "schedulers getting stuck" problem.
> 
> The machines were fixed before I was aware of them, so I didn't
> get a chance to rummage around.  We do not have the output that
> you suggested, statistics(run_queues).  However, we do have
> samples of the traces that are generated by:
> 
>    erlang:trace(all, true, [running,scheduler_id])
> 
> When stuck, the output looks like this, with tuples of
> {scheduler #, # of samples}
> 
>    (riak@REDACTED)1> schedstat:run().
>    <0.16760.459>
>    === in scheduler count===
>    [{1,264},
>     {2,257},
>     {3,0},
>     {4,0},
>     ... and repeating zero samples all the way to
>     to scheduler 64.
> 
> When unstuck, the output looks like this:
> 
>    (riak@REDACTED)1> schedstat:run().
>    <0.3422.460>
>    === in scheduler count===
>    [{1,65},
>     {2,5},
>     {3,0},
>     {4,0},
>     {5,14},
>     {6,73},
>     {7,0},
>     {8,0},
>     {9,0},
>     {10,0},
>     {11,0},
>     {12,0},
>     {13,159},
>     {14,182},
>     {15,6},
>     {16,0},
>     ... and repeating zero samples all the way to
>     to scheduler 64.
> 
> I do not know if the +swt flag was used on these machines,
> sorry.
> 
> Raw output, courtesy of Kelly McLaughlin, is available at
> https://gist.github.com/4009035.  The generating script is
> by Jon Meredith at https://gist.github.com/a460a9dbb11698cf01a6.
> 
> The make-it-unstuck method is this:
> 
>    %% Get current number of online schedulers
>    Schedulers = erlang:system_info(schedulers_online).
> 
>    %% Reduce number online to 1
>    erlang:system_flag(schedulers_online, 1).
> 
>    %% Restore to original number of online schedulers
>    erlang:system_flag(schedulers_online, Schedulers).
> 
> It isn't clear yet if the next release of Riak will use R15B02
> or remain with R15B01.  We were bitten by performance regressions
> (not caught during our pre-release testing) when releasing the
> packages that moved from R14B04 to R15B01.  There's the devil
> we're getting to know versus the devil that takes a heck of a lot
> more time to get to know....
> 
> -Scott




More information about the erlang-questions mailing list