[erlang-questions] Dirty CPU schedulers stuck at zero utilization
Thu Jan 24 22:21:12 CET 2019
After more testing with 21.2.3, it appears that the behavior we are seeing
with respect to dirty schedulers going to sleep occurs only when the dirty
scheduler runs on a cpu hyperthread. On our test instance, I disabled
hyperthreading and used +SP 50:50, and the problem has gone away. This
allows our NIF workload and non-NIF workload an even attribution of CPU
resources, which works well for our purposes.
It's unclear to us why hyperthreading causes this behavior, but we do
wonder if it may have something to do with AWS virtualization. In any case,
thanks to everyone for the assistance!
On Wed, Jan 23, 2019 at 1:14 PM Jesse Stimpson <
> To clarify about our workload, the NIF execution itself is around 1 msec,
> the data on which is operates represents 10 msec of audio. Apologies if my
> last message was unclear.
> Out of convenience, we're using the open source WebRTC project to take
> advantage of their built in PLC, FEC, Opus, etc. The project is written in
> C++, so we have integrated with it via NIF. Unfortunately, re-writing using
> a yielding NIF, or re-writing to Erlang is not as straightforward as we
> would like. Although I admit it would alleviate our scheduling issues.
> We'll continue our testing with 21.2.3 and report back if there appear to
> be any other leads.
> On Tue, Jan 22, 2019 at 2:56 PM Max Lapshin <max.lapshin@REDACTED> wrote:
>> why do you do it via nif?
>> On Wed, Jan 16, 2019 at 6:00 PM Jesse Stimpson <
>> jstimpson@REDACTED> wrote:
>>> It's possible that during our tests the utilization spike was masked by
>>> the collapse issue fixed in the recent PRs. Is there any other analysis I
>>> can provide on the utilization spike/sleep behavior we're seeing, or any
>>> other debugging or code reading you recommend? As far as I can tell,
>>> there's nothing about our workload that would cause periodic behavior like
>>> this. The application is slinging RTP audio via udp to remote endpoints at
>>> a 20 msec ptime. Each function call for the NIF in question adds 10 msec of
>>> audio to the WebRTC buffer.
>>> As point of corroboration, this user on stackoverflow appears to be
>>> having the same or a similar issue:
>>> As always, the level of support from the Erlang community is second to
>>> none. Thanks to all for your time!
>>> On Wed, Jan 16, 2019 at 6:35 AM Rickard Green <rickard@REDACTED>
>>>> On 2019-01-15 23:11, Jesse Stimpson wrote:
>>>> > Behavior of the schedulers appears to have the same issue with 2093
>>>> > But I did notice something new in the msacc output. There is a very
>>>> > brief period, at approx the same time as the normal schedulers usage
>>>> > spikes, where all the dirty cpu schedulers have a significant sleep
>>>> > time. I've included timestamped excerpts below, starting with the
>>>> > increase in dirty cpu sleep, and ending with a "steady state"
>>>> We just released OTP-21.2.3 containing PR-2093.
>>>> I don't think PR-2093 cause the spikes. This change does not affect how
>>>> work is moved between normal and dirty schedulers, only prevents the
>>>> "loss" of dirty schedulers.
>>>> If a process is scheduled on a dirty scheduler it wont make progress
>>>> until it has executed on a dirty scheduler and vice versa (for normal
>>>> schedulers). This is the same both before and after PR-2093. Since
>>>> schedulers aren't "lost" after PR-2093 progress of such processes will
>>>> happen earlier which of course change the behavior, but that is due to
>>>> the work load.
>>> Jesse Stimpson
>>> Site Reliability Engineering
>>> m: 9199950424 <(919)%20995-0424>
>>> RepublicWireless.com <https://republicwireless.com/>
>>> erlang-questions mailing list
> Jesse Stimpson
> Site Reliability Engineering
> m: 9199950424 <(919)%20995-0424>
> RepublicWireless.com <https://republicwireless.com/>
Site Reliability Engineering
m: 9199950424 <(919)%20995-0424>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions