[erlang-questions] Cost of doing +sbwt?

Paul Davis <>
Tue Sep 1 21:14:48 CEST 2015

On Tue, Sep 1, 2015 at 8:37 AM, Jesper Louis Andersen
<> wrote:
> On Tue, Sep 1, 2015 at 9:16 AM, Paul Davis <>
> wrote:
>> Our first data point was to try and look at strace. What we noticed
>> was that scheduler threads seemed to spend an inordinate amount of
>> time in futex system calls. An strace run on a scheduler thread showed
>> more than 50% of time in futex sys calls.
> I may have other comments later, but this should tip you off as to what
> happens. A futex() syscall is made whenever a lock is contended. The
> uncontended case can be handled with no kernel invocation. If you spend your
> time here, you are contended on some resource somewhere inside the system.
> Like Hynek, if you are running on the bare metal and not in some puny
> hypervisor, then setting something like `+sbt db` is often worth it. It
> binds schedulers to physical cores so they don't jump around and destroys
> your TLBs and caches into oblivion.
> I'd have two paths I'd continue on this: lockcnt instrumentation in a
> staging environment and looking at where that contention is. Try to
> reproduce it. Or pray to god you are running on FreeBSD/Illumos in
> production in which case you can find the lock contention with a 5 line
> DTrace script on the production cluster :)
> Also, look at the current scheduler utilization!
> erlang:statistics(scheduler_wall_time) (read the man page, you need a
> system_flag too). You want to look at how much time the schedulers are
> spending doing useful work and how much time they are just spinning waiting
> for more work to come in. Though the high CPU count you are seeing more
> suggests a lock that is contended to me.
> --
> J.

Some more data on this. We don't set +sbt on either R14B01 or 17.5. As
I understand it that means that schedulers are unbound before and
after the upgrade.

As to hyperthreading and reducing scheduler count, after we reverted a
cluster back to R14B01 and kept seeing the elevated system CPU one of
my theories was that the cluster had been experiencing scheduler
collapse which artificially limited the number of schedulers which
reduced lock contention. To try and approximate that I spent some time
playing with different numbers of schedulers online. Setting the
online schedulers to half or a third made a definite impact on the sys
CPU metrics but it wasn't an exact match to the pre-upgrade metrics.
Running +sbwt none gives a much closer pre-upgrade behavior while
maintaining all 24 schedulers.

>From Jesper's comment on futex that was our thought as well. The fact
that we're in the kernel for futex calls should theoretically mean
we're contending quite hard for locks somewhere. That was what lead to
trying to disable the thread pool to see if that was a lock having
issues and eventually what lead me to randomly trying +sbwt none.

Here's a gist with some data from lcnt and strace that I collected.
The lcnt run is before I set +sbwt none, and there's an strace from
before and after. I'm going to try and get some data from the
scheduler_wall_time stuff now and will update with anything I find on

Also, does anyone have a quick pointer to where the busy wait loop is?
After I look at the scheduler time I was going to go find that code
and see if I couldn't come up with a better idea of what exactly might
be changing with that setting.

More information about the erlang-questions mailing list