[erlang-questions] Schedulers go to sleep even with +sfwi 500 option on R16B03

Wed Sep 17 21:17:30 CEST 2014

Hello list,

We faced strange performance degradation on the long living node, and found
out that only 3 schedulers out of 12 on a 12 - core machine had been active
for a very long time before the failure. When the load suddenly increased
these 3 schedulers tried to cope with it and hit 100% utilisation very
fast, and even when after 10-15 minutes other schedulers woke up they also
hit 100 utilisation but because of heavy mail boxes (2-3 millions of msgs)
it was not in their power to cope with it. So then OOM came and killed the
node. It was not the load which our application could not cope with, but
there was very little load for a long time before sudden increase, that's
why I guess there schedulers went to sleep.

We have tried approach mentioned in the
 https://gist.github.com/chewbranca/07d9a6eed3da7b490b47 with +sfwi 500
options and it had been working rather well for a long time, but now even
with this option we faced with this situation again, when after some period
only some of schedulers (4-8) out of 12 begin to work (25-40% of
utilisation).

I've checked run_queue and found out that rather often there are peaks
500-2000 on the graph derieved from erlang:statistics(run_queue). So I
guess when the schedulers wake up every 500 ms there should be a chance for
them to steal some work.

I will give an example:

I check:
14> erlang:system_info(schedulers).
12
15> erlang:system_info(schedulers_online).
12

So all schedulers are online, but only 8 of them are actually working:

16> erlang:statistics(scheduler_wall_time).
[{2,145209897883663,426710944077287},
 {5,138589165714560,426710943955704},
 {3,139109592659767,426710943835106},
 {12,40582505715431,426710943217819},
 {11,42243183369677,426710943767617},
 {10,51368547792756,426710943763646},
 {6,136285798256036,426710943776964},
 {7,105950434195715,426710942867275},
 {9,87470817950104,426710943696217},
 {8,100726919140566,426710943762083},
 {1,146252599566570,426710943763556},
 {4,138618718320296,426710943749318}]

Here we can see that 9-12 schedulers are less busy then others. As I can
see from out metrics they remained in this state for 20-30 hours, which is
crazy.

I tried this solution:
erlang:system_flag(schedulers_online, 1). timer:sleep(10).
erlang:system_flag(schedulers_online, 12).

It helped and activated all schedulers, but only for several hours. Then
situation repeated.
I wonder if there exist any solution to force schedulers not to go to sleep?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140917/e41920dd/attachment.htm>