[erlang-questions] Cost of doing +sbwt?

Tue Sep 1 21:16:13 CEST 2015

Doh, a link for that gist would be handy:

https://gist.github.com/davisp/3ab37e9c69522fe1badd

On Tue, Sep 1, 2015 at 2:14 PM, Paul Davis <paul.joseph.davis@REDACTED> wrote:
> On Tue, Sep 1, 2015 at 8:37 AM, Jesper Louis Andersen
> <jesper.louis.andersen@REDACTED> wrote:
>>
>> On Tue, Sep 1, 2015 at 9:16 AM, Paul Davis <paul.joseph.davis@REDACTED>
>> wrote:
>>>
>>> Our first data point was to try and look at strace. What we noticed
>>> was that scheduler threads seemed to spend an inordinate amount of
>>> time in futex system calls. An strace run on a scheduler thread showed
>>> more than 50% of time in futex sys calls.
>>
>>
>> I may have other comments later, but this should tip you off as to what
>> happens. A futex() syscall is made whenever a lock is contended. The
>> uncontended case can be handled with no kernel invocation. If you spend your
>> time here, you are contended on some resource somewhere inside the system.
>>
>> Like Hynek, if you are running on the bare metal and not in some puny
>> hypervisor, then setting something like `+sbt db` is often worth it. It
>> binds schedulers to physical cores so they don't jump around and destroys
>> your TLBs and caches into oblivion.
>>
>> I'd have two paths I'd continue on this: lockcnt instrumentation in a
>> staging environment and looking at where that contention is. Try to
>> reproduce it. Or pray to god you are running on FreeBSD/Illumos in
>> production in which case you can find the lock contention with a 5 line
>> DTrace script on the production cluster :)
>>
>> Also, look at the current scheduler utilization!
>> erlang:statistics(scheduler_wall_time) (read the man page, you need a
>> system_flag too). You want to look at how much time the schedulers are
>> spending doing useful work and how much time they are just spinning waiting
>> for more work to come in. Though the high CPU count you are seeing more
>> suggests a lock that is contended to me.
>>
>>
>> --
>> J.
>
> Some more data on this. We don't set +sbt on either R14B01 or 17.5. As
> I understand it that means that schedulers are unbound before and
> after the upgrade.
>
> As to hyperthreading and reducing scheduler count, after we reverted a
> cluster back to R14B01 and kept seeing the elevated system CPU one of
> my theories was that the cluster had been experiencing scheduler
> collapse which artificially limited the number of schedulers which
> reduced lock contention. To try and approximate that I spent some time
> playing with different numbers of schedulers online. Setting the
> online schedulers to half or a third made a definite impact on the sys
> CPU metrics but it wasn't an exact match to the pre-upgrade metrics.
> Running +sbwt none gives a much closer pre-upgrade behavior while
> maintaining all 24 schedulers.
>
> From Jesper's comment on futex that was our thought as well. The fact
> that we're in the kernel for futex calls should theoretically mean
> we're contending quite hard for locks somewhere. That was what lead to
> trying to disable the thread pool to see if that was a lock having
> issues and eventually what lead me to randomly trying +sbwt none.
>
> Here's a gist with some data from lcnt and strace that I collected.
> The lcnt run is before I set +sbwt none, and there's an strace from
> before and after. I'm going to try and get some data from the
> scheduler_wall_time stuff now and will update with anything I find on
> that.
>
> Also, does anyone have a quick pointer to where the busy wait loop is?
> After I look at the scheduler time I was going to go find that code
> and see if I couldn't come up with a better idea of what exactly might
> be changing with that setting.