[erlang-questions] How to avoid long_schedule issue ?

Thu Dec 29 00:19:45 CET 2016

Hi,

First of all, lists:seq/2 is not a BIF, it is a pure Erlang function. And
the 20-40 ms "long schedule" events are absolutely normal, you should use a
much larger threshold (I would recommend at least 500 ms) to filter for the
real outliers only.

Now let's see what more can I tell you about long schedules!

When you monitor long schedules, the schedulers will simply note the wall
clock time when they schedule in a process and compare it with the wall
clock time when they schedule it out. If more time has passed than the
threshold you set, you get a message. This unfortunately can be quite a bit
misleading, as it may include times you wouldn't expect. Just like:

   - Other OS processes running on the CPU. The OS can preempt the
   scheduler thread and give the CPU to some other process for 10-20 ms or
   more. And the scheduler will not know about this interruption. I'm pretty
   sure this is the reason for your long schedules, Alex.
   And this is fine as long as the host does not actually become CPU
   limited. If there are many OS processes fighting for the CPU, you will see
   horrible long schedules all over the place.
   - The OS performing some time consuming task for you. Like a page fault
   that requires reading from the disk swap. Whoops, your scheduler is
   suspended for tens of milliseconds without noticing it!
   - The OS performing some time consuming interrupt handling while your
   scheduler has the CPU.
   This is my personal favourite, because this means some totally unrelated
   OS code (like crappy NIC drivers) runs in your process' context, and can
   log scary looking messages that all seem to come from the pure beam
   process. Not to mention long schedules, of course.
   - The time it takes to grab some internal locks in the VM. If the lock
   is held by a long scheduling process, every other process waiting for the
   same lock will also long schedule.
   In our production system for example ~50% of long schedules come from a
   single monitoring process that periodically collects process info from
   other processes. Of course if a process long schedules with e.g. 1200 ms,
   the monitoring process will have to wait up to 1200 ms as well to grab the
   lock on it required for fetching process info.

It is also good to know that not all BIF-s are preemptable, and those that
are, will calculate their reduction cost in very ad-hoc looking ways. For
example, it looks like that the lists:reverse/2 BIF can process 40 list
elements per reduction, while lists:keyfind/3 can search 10 list elements
per reduction. Do you think that reverting 40 list elements and looking
through 10 list elements would take exactly the same wall clock time?
Probably not. And they probably won't take exactly as much time either as
an average reduction when executing your application's Erlang code. These
reduction cost estimates work fine in most cases, but can be inaccurate
when you give huge inputs to these functions. If they happen to be too low
estimates on your system, you may still see long schedules when all your
BIF-s and NIF-s are nice and preemptable.

Now if you still have interesting long schedules that you want to debug,
you need to keep in mind that the schedule in and schedule out functions
are not necessarily the point where the time is wasted. For example if you
have a gen_server that - when handling one particular message - calls an
unfriendly BIF/NIF which doesn't update the reduction count, you will
typically see that both the schedule in and schedule out points are
gen_server:loop/6. Nothing will point to the BIF/NIF in the event, so good
luck finding the offender! You have to consider all execution pathes that
may lead from the schedule in point to the schedule out point. The offender
can be any of the functions used on any of these pathes.

Finally, a bit about finding the ideal long schedule threshold. 10 ms is
typically too low: it basically means every time the OS schedules out your
VM thread you will get a long schedule. But you need to consider the
latency requirements of your application: if you're doing high frequency
trading or ad bidding or whatever, maybe a 10 ms pause would be too much
for you. In this case you can use such a low threshold, but be sure to turn
off swapping and pin your schedulers to cores that are exclusively used by
the VM, and where you have disabled interrupts, tick handling etc. In
general, for a system where you need to keep latency under T, it makes
sense to monitor long schedules with a threshold of ~0.5 T - 0.8 T or so.

Both heart and the net ticktime of the distribution protocol give you such
latency requirements: heart needs a heart beat message every 60 seconds and
the distribution protocol sends one hear beat message every 15 seconds. So
long schedules in the 15,000 ms range start to interfere with the
distribution protocol, and above 60,000 ms can kill your node. (These
limits may sound crazy, but I regularly see ~20,000 ms long schedules in
our systems. Unfortunately.)

Hope this helps!

Daniel

On Wed, 28 Dec 2016 at 16:23 Max Lapshin <max.lapshin@REDACTED> wrote:

> I'm also very interested in how to properly interpret these warnings =(
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20161228/93d834c2/attachment.htm>