[erlang-questions] Cost of doing +sbwt?

Tue Sep 1 09:16:19 CEST 2015

Hey everyone,

Lately at IBM/Cloudant we've been doing cluster upgrades from R14B01
to 17.5 (don't ask) and we've been noticing a fairly decent increase
in total CPU usage. The recent upgrades have actually included a two
step upgrade between three versions which I'll call A, B, and C.
Version A is our most recent production release on R14B01, version B
is a compatibility release with minor changes that still runs on
R14B01, and version C is exactly version B code except for the upgrade
to the 17.5 emulator. Each upgrade in the series involves a rolling
reboot of the entire cluster which involves anywhere from three to
twenty or so nodes at this point.

What we've observed is that each step in the upgrade process tends to
increase the CPU usage across the cluster noticeably. There's some
obvious CPU issues with a rolling reboot in our clusters that we
expect but the end result as been a significant jump in system CPU
usage between A and C. Generally speaking we're seeing A running
around 6-12% system CPU and version C running around 50% system CPU
usage.

When I first saw this my first thought was that it was just the busy
wait scheduling that has been discussed occasionally on this list.
Testing had shown a slight decrease in performance between R14B01 and
17.5 but it was roughly within in the margin of error as well as not
being the most awesome representation of real user load. Given that we
pushed through some less busy clusters and everything appeared to
match with my expectations, namely increased CPU usage without any
observable effects to the outside world.

Unfortunately we had two clusters that had previously been running
fairly hot and after the upgrade response latencies jumped
significantly. Internal metrics showed latencies increasing by an
order of magnitude as well as having significantly higher variance.

On one cluster we ended up rolling back from C to A to try and undo
the latency issues which ended up having little effect on total
latencies. We did manage to change which parts of the database were
having latency issues but overall it was an unfun operational
experience.

So that's the background, here are some obvservations that I've been
collecting to try and narrow down the issue.

Our first data point was to try and look at strace. What we noticed
was that scheduler threads seemed to spend an inordinate amount of
time in futex system calls. An strace run on a scheduler thread showed
more than 50% of time in futex sys calls.

On one of our dedicated clusters we ended up disabling the async
thread pool (brilliant idea via Adam Kocoloski). While this had a
super awesome impact on observed latencies for the cluster, CPU usage
remained roughly the same with ~50% of CPU usage attributed to system
CPU. Of note this cluster is a 9 node cluster with dual hex cores with
hyperthreading for a total of 24 scheduler threads as well as a four
SSD RAID 10 array for disk. My general thought here was that while it
didn't necessarily fix the underlying issue, SSDs without an async
thread pool made things fast enough to get back to roughly normal on
this cluster.

The second major data point was a similarly spec'ed cluster except
that instead of SSD's it had HDDs in the same RAID configuration.
Attempting to help fix latencies we made the same change to disable
the async thread pool. The change appeared to make the cluster faster
but there wasn't a significant change in latencies for the cluster.
There was an observable increase in throughput for the entire cluster
but as in all distributed systems it was a bit hard to deconvolve the
various signals.

Given these two data points I decided to try looking at the VM flags
that might be major behavior changes between R14B01 and 17.5. The
flags I decided to try setting were "+sbwt none +secio false +sws
legacy" (from memory). I applied those to one node on the MT cluster
and it resulted in an immediate change from roughly fifty to six-ish
percent on CPU usage.

Once I saw the drastic change in CPU I replaced the change to just
"+sbwt none" and saw no change in CPU usage which suggested that the
sbwt change was entirely responsible for the change. After letting the
cluster run I also tried using "+sbwt very_low" which resulted in a
noticeable increase in system CPU but nowhere near the ~50% system CPU
default.

The MT cluster I did the initial +sbwt change on was the multi-tenant
cluster so its hard to make too many conclusions given the roughly
random work load. What I did do was check that very_low increased the
system CPU usage noticeably and that default put it back to 50%.
During these experiments our various latency metrics seemed to respond
but distributed systems, so its hard to draw direct conclusions.

However, given my observations I was pretty certain that the +sbwt
setting was directly affecting system CPU usage. The general theory
I've been working on is that something with Erlang's elevated system
CPU usage is causing my observed latencies to increase. Given that, I
set +sbwt none to one of our problematic clusters. The initial
response to this change was a significant increase in throughput of
background tasks on the cluster. Once those cleared out the cluster
latencies have settled back into the previous R14B01 levels.

My general question at this point is wether scheduler busy wait has an
intrinsic cost that's not reflected in the documentation. From what
I've seen is that we end up with a large amount of system CPU usage
that dampens user CPU usage of beam.smp which in our case affects
system latencies. We have a fairly intense port/file driver usage
which may affect the issue. Disabling busy wait fixes things and that
confuses me.

Any thoughts or similar observations would be awesomely appreciated. I
think I can see how +sbwt values affect a VM but I'm fairly confused
why reverting back to R14 didn't rectify things. I'm really hoping
someone remembers a scheduler bug around this area for that part of
the story.

I'll buy many beers at the next conference for anyone that makes me
think I'm not insane in all this.