[erlang-bugs] Scheduler thread spins in futex_wait and sched_yield

Fri Jun 15 10:10:58 CEST 2012

Well some more findings on this.
My workload is typically high and I have seen run queue sizes around 600.

The schedulers always keep dropping off in descending order. First
scheduler 4 and then scheduler 3. I have not seen it drop after that.
Apparently the schedulers are ok but the run queue associated to that
scheduler is empty, atleast that is what i could find by monitoring
statistics(run_queues). Now the good thing is that I can get the scheduler
to start ticking again without a restart. A spawn with scheduler set to the
particular scheduler id gets things back to normal again.
spawn_opt(fun() -> ok end, [{scheduler, 4}]).
Spawning with scheduler 4 brings back both 3 and 4.

Extending this I have been able to make sure that schedulers dont drop off
just by having one long running active process tied to the last scheduler
id.

Now maybe this is more of a feature than a bug, if so I'm not sure its
helping because once the runq goes dry I have not seen it come back online
automatically, this causes a lag in the system as processes are getting
cleared off by 2 instead of 4 threads. In addition I see this behaviour is
also present in R14B03.

--
Jebu

On Tue, Jun 12, 2012 at 10:51 AM, Jebu Ittiachen
<jebu.ittiachen@REDACTED>wrote:

> Hi,
>   I seem to have hit upon a weird bug in the Erlang scheduler. I'm running
> R15B01 on Linux 64bit, Erlang compiled with HiPE disabled. Erlang starts up
> with 4 scheduler threads and everything is ok for a while. After a period
> of time the CPU usage drops on the machine and things start going slow. top
> -H shows 2 threads of the 4 running at around 15% and the other 2 at 95%.
> Typically all 4 threads are more or less in the same CPU utilization
> figures. strace on the process shows the two sluggish threads alternating
> between calls to futex_wait and sched_yield while the other two are doing a
> lot of other stuff.
>
>   Here is a sample of strace -f -p <pid> |grep <thread id>
>
> 20292 sched_yield( <unfinished ...>
> 20292 <... sched_yield resumed> )       = 0
> 20292 sched_yield( <unfinished ...>
> 20292 <... sched_yield resumed> )       = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 futex(0x1bf2220, FUTEX_WAIT_PRIVATE, 4294967295, NULL <unfinished
> ...>
> 20292 <... futex resumed> )             = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
> 20292 sched_yield()                     = 0
>
>   My only option out of this now is to restart the node, when it again
> runs happily for a while before scheduler threads start dropping off. I'd
> be happy to provide any more dumps/info that maybe needed to get to the
> bottom of this.
>
> Thanks
> --
> Jebu Ittiachen
> jebu.ittiachen@REDACTED
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20120615/3222d032/attachment.htm>