[erlang-questions] Schedulers getting "stuck", part II

Michael Truog mjtruog@REDACTED
Sat Apr 27 05:09:33 CEST 2013

On 04/26/2013 07:20 PM, Scott Lystig Fritchie wrote:
> Howdy.  This is a followup to the discussion that took place on this
> list in October 2012, see:
>     http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
>         (first message only, I dunno why)
>     http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
>         (the rest of the thread)
> I've been trying to figure out how to introduce the stuff that I've
> written at:
>     https://github.com/slfritchie/nifwait/tree/md5#readme
> ... but I still can't decide.  So I'll try for something short and
> un-Scott-like.  For the long story, please read the README in the URL
> above.
> As for the short story, I believe a couple of things:
> * R15B0x's schedulers are broken: Basho seen "stuck" schedulers in one of
>   our apps with no custom NIF code.  And it's possible to get them stuck
>   using only the 'crypto' module's MD5 functions.
> * R16B's schedulers appear to be even more broken: I have a
>   mostly-deterministic case that demonstrates schedulers that go to
>   sleep and do not wake for minutes (or hours) when there is plenty of
>   work to do.  This also is using only the 'crypto' module and does not
>   require custom NIF code.
> Discuss.  :-)
> -Scott
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
I would expect that some useful numbers to understand the problem of crypto's impact on the scheduler would be the min, max, mean, and stddev of the time spent computing the MD5 within the NIF.  Based on previous complaints about long-running NIF code, this problem should be more likely with higher latency finding the MD5 (so slower computers).  I have not looked at the scheduling criteria in the recent source code, but I would assume that this behaviour is caused by some abnormality in the latency within the MD5 NIF function (where either the mean or stddev is high enough to cause this behaviour).  With the min, max, mean and stddev (found with a custom NIF) you could probably model the behaviour with a function that does a quick sleep (which would provide the OTP team with test cases for development).  I don't have data or code to back this up, but I think the approach would be helpful and the data would help determine the proper use of NIFs.  I believe that an older
version of the crypto driver from R12 or R13 had code for using the async thread pool (it may not have been turned on, I remember it being conditional), such that this behaviour should not occur.  So, testing erlang md5 usage in R12 or R13 should confirm that this is a problem specific to the scheduling of NIF code.

- Michael

More information about the erlang-questions mailing list