[erlang-questions] Schedulers getting "stuck", part II

Sat Apr 27 11:16:58 CEST 2013

Our code uses a lot of small NIF calls. The code basically does some
memcpy and ends with a write to a non blocking file descriptor. It can
get called thousands of times per second.
We moved a few servers to R16. The CPU usage is  noticeably lower than
R14, but worryingly so. At first we had a lot of problems with
schedulers just stop working for up to a minute. Even the console was
unresponsive. I saw basho added erlang:bump_reductions calls to their
NIF calls and we added that. It did improve the situation but small
blocks still seem to happen.

Sergej

On Sat, Apr 27, 2013 at 5:09 AM, Michael Truog <mjtruog@REDACTED> wrote:
>
> On 04/26/2013 07:20 PM, Scott Lystig Fritchie wrote:
> > Howdy.  This is a followup to the discussion that took place on this
> > list in October 2012, see:
> >
> >     http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
> >         (first message only, I dunno why)
> >     http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
> >         (the rest of the thread)
> >
> > I've been trying to figure out how to introduce the stuff that I've
> > written at:
> >
> >     https://github.com/slfritchie/nifwait/tree/md5#readme
> >
> > ... but I still can't decide.  So I'll try for something short and
> > un-Scott-like.  For the long story, please read the README in the URL
> > above.
> >
> > As for the short story, I believe a couple of things:
> >
> > * R15B0x's schedulers are broken: Basho seen "stuck" schedulers in one of
> >   our apps with no custom NIF code.  And it's possible to get them stuck
> >   using only the 'crypto' module's MD5 functions.
> >
> > * R16B's schedulers appear to be even more broken: I have a
> >   mostly-deterministic case that demonstrates schedulers that go to
> >   sleep and do not wake for minutes (or hours) when there is plenty of
> >   work to do.  This also is using only the 'crypto' module and does not
> >   require custom NIF code.
> >
> > Discuss.  :-)
> >
> > -Scott
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> >
> I would expect that some useful numbers to understand the problem of crypto's impact on the scheduler would be the min, max, mean, and stddev of the time spent computing the MD5 within the NIF.  Based on previous complaints about long-running NIF code, this problem should be more likely with higher latency finding the MD5 (so slower computers).  I have not looked at the scheduling criteria in the recent source code, but I would assume that this behaviour is caused by some abnormality in the latency within the MD5 NIF function (where either the mean or stddev is high enough to cause this behaviour).  With the min, max, mean and stddev (found with a custom NIF) you could probably model the behaviour with a function that does a quick sleep (which would provide the OTP team with test cases for development).  I don't have data or code to back this up, but I think the approach would be helpful and the data would help determine the proper use of NIFs.  I believe that an older
> version of the crypto driver from R12 or R13 had code for using the async thread pool (it may not have been turned on, I remember it being conditional), such that this behaviour should not occur.  So, testing erlang md5 usage in R12 or R13 should confirm that this is a problem specific to the scheduling of NIF code.
>
> - Michael
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions