<div dir="ltr"><span style="font-family:arial,sans-serif;font-size:13px">Hello list,</span><br style="font-family:arial,sans-serif;font-size:13px"><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">We faced strange performance degradation on the long living node, and found out that only 3 schedulers out of 12 on a 12 - core machine had been active for a very long time before the failure. When the load suddenly increased these 3 schedulers tried to cope with it and hit 100% utilisation very fast, and even when after 10-15 minutes other schedulers woke up they also hit 100 utilisation but because of heavy mail boxes (2-3 millions of msgs) it was not in their power to cope with it. So then OOM came and killed the node. It was not the load which our application could not cope with, but there was very little load for a long time before sudden increase, that's why I guess there schedulers went to sleep.</span><br style="font-family:arial,sans-serif;font-size:13px"><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">We have tried approach mentioned in the</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> </span><a href="https://gist.github.com/chewbranca/07d9a6eed3da7b490b47" target="_blank" style="font-family:arial,sans-serif;font-size:13px">https://gist.github.com/chewbranca/07d9a6eed3da7b490b47</a><span style="font-family:arial,sans-serif;font-size:13px"> with +sfwi 500 options and it had been working rather well for a long time, but now even with this option we faced with this situation again, when after some period only some of schedulers (4-8) out of 12 begin to work (25-40% of utilisation).</span><br style="font-family:arial,sans-serif;font-size:13px"><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">I've checked run_queue and found out that rather often there are peaks 500-2000 on the graph derieved from erlang:statistics(run_queue). So I guess when the schedulers wake up every 500 ms there should be a chance for them to steal some work.</span><br style="font-family:arial,sans-serif;font-size:13px"><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">I will give an example:</span><br style="font-family:arial,sans-serif;font-size:13px"><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">I check:</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">14> erlang:system_info(schedulers)</span><span style="font-family:arial,sans-serif;font-size:13px">.</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">12</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">15> erlang:system_info(schedulers_</span><span style="font-family:arial,sans-serif;font-size:13px">online).</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">12</span><br style="font-family:arial,sans-serif;font-size:13px"><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">So all schedulers are online, but only 8 of them are actually working:</span><br style="font-family:arial,sans-serif;font-size:13px"><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">16> erlang:statistics(scheduler_</span><span style="font-family:arial,sans-serif;font-size:13px">wall_time).</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">[{2,145209897883663,</span><span style="font-family:arial,sans-serif;font-size:13px">426710944077287},</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> {5,138589165714560,</span><span style="font-family:arial,sans-serif;font-size:13px">426710943955704},</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> {3,139109592659767,</span><span style="font-family:arial,sans-serif;font-size:13px">426710943835106},</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> {12,40582505715431,</span><span style="font-family:arial,sans-serif;font-size:13px">426710943217819},</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> {11,42243183369677,</span><span style="font-family:arial,sans-serif;font-size:13px">426710943767617},</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> {10,51368547792756,</span><span style="font-family:arial,sans-serif;font-size:13px">426710943763646},</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> {6,136285798256036,</span><span style="font-family:arial,sans-serif;font-size:13px">426710943776964},</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> {7,105950434195715,</span><span style="font-family:arial,sans-serif;font-size:13px">426710942867275},</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> {9,87470817950104,</span><span style="font-family:arial,sans-serif;font-size:13px">426710943696217},</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> {8,100726919140566,</span><span style="font-family:arial,sans-serif;font-size:13px">426710943762083},</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> {1,146252599566570,</span><span style="font-family:arial,sans-serif;font-size:13px">426710943763556},</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px"> {4,138618718320296,</span><span style="font-family:arial,sans-serif;font-size:13px">426710943749318}]</span><br style="font-family:arial,sans-serif;font-size:13px"><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">Here we can see that 9-12 schedulers are less busy then others. As I can see from out metrics they remained in this state for 20-30 hours, which is crazy.</span><br style="font-family:arial,sans-serif;font-size:13px"><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">I tried this solution:</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">erlang:system_flag(schedulers_</span><span style="font-family:arial,sans-serif;font-size:13px">online, 1). timer:sleep(10). erlang:system_flag(schedulers_</span><span style="font-family:arial,sans-serif;font-size:13px">online, 12).</span><br style="font-family:arial,sans-serif;font-size:13px"><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">It helped and activated all schedulers, but only for several hours. Then situation repeated.</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">I wonder if there exist any solution to force schedulers not to go to sleep?</span><br></div>