[erlang-questions] Cost of doing +sbwt?

Fri Sep 4 00:39:22 CEST 2015

Sooo....

Distributed systems are hard. Sometimes its useful to take a step back
and look at everything going on around you so that you aren't missing
any important details. Like say, perhaps, the recent inclusion of
auditd on your infrastructure. I'm sure that collective groan I'm
imagining indicates everyone knows where this one is going. Although,
there is a pretty picture linked below that's actually a bit
interesting.

As it turns out security people showed up and said that we need to do
something to comply with some nonsensical acronym and the judgement
was that auditd was the easiest way to accomplish that. Well howdy do
then.

One of our operator/releng engineers found that perf was suggesting
auditd was possibly a culprit for increased system CPU so we did the
obvious test and turned it off. The sudden drop in system CPU usage
made it instantly obvious we'd found the issue. He happened to do that
on a VM configured with +sbwt very_long while testing another theory
so I ended up reproducing his experiment as well as extending the
analysis a bit after he went home today. The image below captures
everything I did over the course of an hour that shows quite clearly
that auditd is terrible and I hate it.

http://i.imgur.com/v5Ggwc4.png

A couple things about the picture. In my original post I mentioned
that we had three versions for our upgrade. In the graph we've
installed the three separate versions on different nodes (ie, db1 is
version A, db2 is version B, db3 is version C, where A and B are on
R14B01, and C is the same code as B but on 17.5). One of our engineers
had that clever idea and its certainly made observing relative CPU
changes rather easy. The entire timeline of these graphs are also
while there's a sustained load on the cluster. Its nothing special,
just enough work to get the CPUs doing things so we can change
configurations and observe the relative differences. Also, all three
machines are identical hardware and kernels and all that. Also
important as that the 17.5 node is using the default value for +sbwt
so there's nothing special with that for this data.

The few things to note that I found interesting is that while auditd
affected both R14B01 and 17.5, the effect on 17.5 was more pronounced.
R14B01's sys CPU was increased roughly 2x while it was closer to 3x on
17.5. The second interesting observation was that it wasn't the
syscall rules we had for auditd, but the file modification rules that
were the cause of significant CPU usage. And the file rules had a much
more pronounced effect on 17.5. You can see where the system CPU
starts raising noticeably around 18:55 on db3.

So I guess I now have to disappear into the bureaucracy to figure out
how to maintain compliance while not leaving performance dead at the
side of the road.

Lastly a few other observations so far after testing more with auditd disabled:

On our baseline config (mostly +A 32 and a few limits raised) 17.5
uses ~4% more system CPU than R14B01

+scl false reduces system CPU usage a couple percent
+scl false +sbt db decreases slightly from the baseline
+sbt db alone increases system CPU noticeably above the baseline
+scl false +sfwi 500 doesn't really change CPU usage from just +scl
false (but then you'd expect that when its under constant load)

So basically the moral of the story is that Erlang is fine. We'll
obviously redo the various VM changes on an unmixed cluster to look at
performance but its obvious that auditd is our issue for the system
CPU issues.

Cheers,
Paul Davis

On Wed, Sep 2, 2015 at 7:23 AM, Jesper Louis Andersen
<jesper.louis.andersen@REDACTED> wrote:
> If memory serves, R14 can't experience scheduler collapse since it doesn't
> do rebalancing of work the same way as R15 and onwards. So I think this is a
> red herring.
>
> Have you established a baseline for the locking in R14? You are contending
> on the runqueue lock quite a lot, which could account for all the spinning
> you are seeing, but it is hard to say if this is a high or low number
> without some baseline you can use to relate. Also, many of the futex() calls
> are probably for this contention as well. There is a chance your scheduler
> utilization isn't that high, but you are getting into the spinning all the
> time. If utilization is fairly low, then the 50% CPU isn't of concern: just
> load the system more :)
>
> Chances are you are hunting the wrong mark as well: You have 2 or more
> pathologies, and they overlap in what you are seeing. Hence you get
> distracted by the noise generated by the other problems. It may be you have
> a CPU problem and on top of that, you have a latency/blocking problem in an
> I/O layer as well. One could account for the latency spikes, whereas the
> other would explain the high CPU. But if you don't know you are sitting with
> two problems in the first place, then their cooperation in the system
> confuses you.
>
> If you have elevated CPU, then a snapshot of the current thread stacks at
> 97hz per second[1] should tell you where things are taking time. This in
> itself could give you hints as to where you are spending all of your time in
> the system, and also what you are spinning on, if anything.
>
> [1] Old trick: Never snapshot at 100hz or something which means you can get
> into phase with other jobs. Pick some prime around your target.
>
>
> On Tue, Sep 1, 2015 at 9:42 PM, Lukas Larsson <lukas@REDACTED>
> wrote:
>>
>> On Tue, Sep 1, 2015 at 9:14 PM, Paul Davis <paul.joseph.davis@REDACTED>
>> wrote:
>>>
>>> Also, does anyone have a quick pointer to where the busy wait loop is?
>>> After I look at the scheduler time I was going to go find that code
>>> and see if I couldn't come up with a better idea of what exactly might
>>> be changing with that setting.
>>
>>
>> This should be the code that does the waiting:
>> https://github.com/erlang/otp/blob/master/erts/lib_src/pthread/ethr_event.c#L65-L161
>>
>> The mutex implementation that calls it is in here:
>> https://github.com/erlang/otp/blob/master/erts/lib_src/common/ethr_mutex.c
>>
>> The different spin options are set here:
>> https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_process.c#L5325-L5364
>>
>> There are also a couple of other places where it spind in erl_process.c,
>> just search for spin and you'll find them :)
>
>
>
>
> --
> J.