[erlang-questions] Dirty CPU schedulers stuck at zero utilization

Wed Jan 23 19:14:26 CET 2019

To clarify about our workload, the NIF execution itself is around 1 msec,
the data on which is operates represents 10 msec of audio. Apologies if my
last message was unclear.

Out of convenience, we're using the open source WebRTC project to take
advantage of their built in PLC, FEC, Opus, etc. The project is written in
C++, so we have integrated with it via NIF. Unfortunately, re-writing using
a yielding NIF, or re-writing to Erlang is not as straightforward as we
would like. Although I admit it would alleviate our scheduling issues.

We'll continue our testing with 21.2.3 and report back if there appear to
be any other leads.

Thanks,
Jesse

On Tue, Jan 22, 2019 at 2:56 PM Max Lapshin <max.lapshin@REDACTED> wrote:

> why do you do it via nif?
>
> On Wed, Jan 16, 2019 at 6:00 PM Jesse Stimpson <
> jstimpson@REDACTED> wrote:
>
>> It's possible that during our tests the utilization spike was masked by
>> the collapse issue fixed in the recent PRs. Is there any other analysis I
>> can provide on the utilization spike/sleep behavior we're seeing, or any
>> other debugging or code reading you recommend? As far as I can tell,
>> there's nothing about our workload that would cause periodic behavior like
>> this. The application is slinging RTP audio via udp to remote endpoints at
>> a 20 msec ptime. Each function call for the NIF in question adds 10 msec of
>> audio to the WebRTC buffer.
>>
>> As point of corroboration, this user on stackoverflow appears to be
>> having the same or a similar issue:
>> https://stackoverflow.com/questions/49563067/erlang-schedulers-just-sleep-why
>>
>> As always, the level of support from the Erlang community is second to
>> none. Thanks to all for your time!
>>
>> Jesse
>>
>>
>>
>> On Wed, Jan 16, 2019 at 6:35 AM Rickard Green <rickard@REDACTED> wrote:
>>
>>> On 2019-01-15 23:11, Jesse Stimpson wrote:
>>> > Behavior of the schedulers appears to have the same issue with 2093
>>> patch.
>>> >
>>> > But I did notice something new in the msacc output. There is a very
>>> > brief period, at approx the same time as the normal schedulers usage
>>> > spikes, where all the dirty cpu schedulers have a significant sleep
>>> > time. I've included timestamped excerpts below, starting with the
>>> > increase in dirty cpu sleep, and ending with a "steady state"
>>> utilization.
>>> >
>>>
>>> We just released OTP-21.2.3 containing PR-2093.
>>>
>>> I don't think PR-2093 cause the spikes. This change does not affect how
>>> work is moved between normal and dirty schedulers, only prevents the
>>> "loss" of dirty schedulers.
>>>
>>> If a process is scheduled on a dirty scheduler it wont make progress
>>> until it has executed on a dirty scheduler and vice versa (for normal
>>> schedulers). This is the same both before and after PR-2093. Since dirty
>>> schedulers aren't "lost" after PR-2093 progress of such processes will
>>> happen earlier which of course change the behavior, but that is due to
>>> the work load.
>>>
>>> Regards,
>>> Rickard
>>>
>>
>>
>> --
>>
>> <http://www.republicwireless.com/>
>>
>> Jesse Stimpson
>>
>> Site Reliability Engineering
>>
>> m: 9199950424 <(919)%20995-0424>
>> RepublicWireless.com <https://republicwireless.com/>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>

-- 

<http://www.republicwireless.com/>

Jesse Stimpson

Site Reliability Engineering

m: 9199950424 <(919)%20995-0424>
RepublicWireless.com <https://republicwireless.com/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190123/4332fd3b/attachment.htm>