Well some more findings on this. <div>My workload is typically high and I have seen run queue sizes around 600. </div><div><br></div><div>The schedulers always keep dropping off in descending order. First scheduler 4 and then scheduler 3. I have not seen it drop after that. Apparently the schedulers are ok but the run queue associated to that scheduler is empty, atleast that is what i could find by monitoring statistics(run_queues). Now the good thing is that I can get the scheduler to start ticking again without a restart. A spawn with scheduler set to the particular scheduler id gets things back to normal again. </div>
<div>spawn_opt(fun() -> ok end, [{scheduler, 4}]).</div><div>Spawning with scheduler 4 brings back both 3 and 4.</div><div><br></div><div>Extending this I have been able to make sure that schedulers dont drop off just by having one long running active process tied to the last scheduler id.</div>
<div><br></div><div>Now maybe this is more of a feature than a bug, if so I'm not sure its helping because once the runq goes dry I have not seen it come back online automatically, this causes a lag in the system as processes are getting cleared off by 2 instead of 4 threads. In addition I see this behaviour is also present in R14B03.<br clear="all">
<br></div><div>--<br>Jebu <br><br><div class="gmail_quote">On Tue, Jun 12, 2012 at 10:51 AM, Jebu Ittiachen <span dir="ltr"><<a href="mailto:jebu.ittiachen@gmail.com" target="_blank">jebu.ittiachen@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<div> I seem to have hit upon a weird bug in the Erlang scheduler. I'm running R15B01 on Linux 64bit, Erlang compiled with HiPE disabled. Erlang starts up with 4 scheduler threads and everything is ok for a while. After a period of time the CPU usage drops on the machine and things start going slow. top -H shows 2 threads of the 4 running at around 15% and the other 2 at 95%. Typically all 4 threads are more or less in the same CPU utilization figures. strace on the process shows the two sluggish threads alternating between calls to futex_wait and sched_yield while the other two are doing a lot of other stuff.</div>
<div><br></div><div> Here is a sample of strace -f -p <pid> |grep <thread id></div><div><br></div><div><div><div>20292 sched_yield( <unfinished ...></div><div>20292 <... sched_yield resumed> ) = 0</div>
<div>20292 sched_yield( <unfinished ...></div><div>20292 <... sched_yield resumed> ) = 0</div><div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div>
<div>
20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div>
<div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div><div>20292 futex(0x1bf2220, FUTEX_WAIT_PRIVATE, 4294967295, NULL <unfinished ...></div>
<div>20292 <... futex resumed> ) = 0</div><div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div>
<div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div><div>
20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div><div>20292 sched_yield() = 0</div><div><br></div></div></div><div> My only option out of this now is to restart the node, when it again runs happily for a while before scheduler threads start dropping off. I'd be happy to provide any more dumps/info that maybe needed to get to the bottom of this. </div>
<div><br></div><div>Thanks</div><div>--<br>Jebu Ittiachen<br><a href="mailto:jebu.ittiachen@gmail.com" target="_blank">jebu.ittiachen@gmail.com</a><br>
</div>
</blockquote></div><br></div>