[erlang-questions] +swt very_low doesn't seem to avoid schedulers getting

Scott Lystig Fritchie fritchie@REDACTED
Sat Nov 3 23:36:19 CET 2012


A few weeks ago, Rickard Green <rickard@REDACTED> wrote:

> However, you can call "statistics(run_queues)" (note that the argument 
> should be 'run_queues', and not 'run_queue') repeatedly, say once every 
> 100 ms (perhaps even more frequent than this) for say 10 seconds when 
> the system is in this state . That information will at least give us a 
> good hunch of what is going on. statistics(run_queues) returns a tuple 
> containing the run queue length of each run queue as elements.

Hiya.  We have some new data from three customer machines running
Riak 1.2.1 with R15B01 that all hit what appears to be this same
"schedulers getting stuck" problem.

The machines were fixed before I was aware of them, so I didn't
get a chance to rummage around.  We do not have the output that
you suggested, statistics(run_queues).  However, we do have
samples of the traces that are generated by:

    erlang:trace(all, true, [running,scheduler_id])

When stuck, the output looks like this, with tuples of
{scheduler #, # of samples}

    (riak@REDACTED)1> schedstat:run().
    <0.16760.459>
    === in scheduler count===
    [{1,264},
     {2,257},
     {3,0},
     {4,0},
     ... and repeating zero samples all the way to
     to scheduler 64.

When unstuck, the output looks like this:

    (riak@REDACTED)1> schedstat:run().
    <0.3422.460>
    === in scheduler count===
    [{1,65},
     {2,5},
     {3,0},
     {4,0},
     {5,14},
     {6,73},
     {7,0},
     {8,0},
     {9,0},
     {10,0},
     {11,0},
     {12,0},
     {13,159},
     {14,182},
     {15,6},
     {16,0},
     ... and repeating zero samples all the way to
     to scheduler 64.

I do not know if the +swt flag was used on these machines,
sorry.

Raw output, courtesy of Kelly McLaughlin, is available at
https://gist.github.com/4009035.  The generating script is
by Jon Meredith at https://gist.github.com/a460a9dbb11698cf01a6.

The make-it-unstuck method is this:

    %% Get current number of online schedulers
    Schedulers = erlang:system_info(schedulers_online).
    
    %% Reduce number online to 1
    erlang:system_flag(schedulers_online, 1).
    
    %% Restore to original number of online schedulers
    erlang:system_flag(schedulers_online, Schedulers).

It isn't clear yet if the next release of Riak will use R15B02
or remain with R15B01.  We were bitten by performance regressions
(not caught during our pre-release testing) when releasing the
packages that moved from R14B04 to R15B01.  There's the devil
we're getting to know versus the devil that takes a heck of a lot
more time to get to know....

-Scott



More information about the erlang-questions mailing list