[erlang-questions] Re: how do I do the equivalent of ets:tab2list(timer_tab) for BIF timers

Matthias Lang matthias@REDACTED
Tue Dec 7 17:18:24 CET 2010


Matt> >In short: A bug in the linux kernel on Au1000 MIPS CPUs causes
Matt> >       Erlang's timers and timeouts to go haywire, sometimes.
Matt> >       Most likely, nobody but me is affected. Erlang is not the problem.

Per> Hm, we actually recently had a report that the VM (R13B03) aborted on
Per> startup with "Unexpected behaviour from operating system high resolution
Per> timer" - this is from erl_time_sup.c, and (on Linux) the only reason I
Per> can see is that clock_gettime(CLOCK_MONOTONIC) went backwards (which
Per> should *never* happen, of course).

I agree that it should never happen.

In my case, the clock_gettime() timestamp temporarily jumped forwards
by hours and then jumped backwards by about the same amount. That
won't trigger the message you mention, the "unexpected..." message
only pops up if the time jumps backwards further than it was when
Erlang started.

I considered adding code to erl_time_sup.c to detect _any_ case of
time going backwards so that it would be easier to detect similar
problems. I decided not to because (a) I didn't want to complicate
Erlang with code to check for buggy OS behaviour and (b)
CLOCK_MONOTONIC is a linux-ism, I'm not sure if the same "always goes
forwards" guarantees apply to other sorts of high-resolution timers.
It probably does.

> >The root cause was a concurrency/locking problem: a missing lock in an
> >interrupt routine which meant that a call to clock_gettime() at the
> >"wrong" moment could read junk.

> So was this specific to the above CPU? I believe the system where the
> abort happened was x86_64 (running CentOS 5.3).

The fix is specific to the MIPS Au1000 CPU. See the patch below.

It's possible that other CPUs have the same problem, but not very
likely. The timer code is very CPU-specific, but, then again it might
have been cut from some other architecture and then hacked. I didn't
see the bug anywhere else when I took a quick look at other MIPS
variants yesterday. I didn't look at x86 at all.

I have a 20 line C test program which always fails immediately on
the broken MIPS kernel. It never fails on my x86_64 (2.6.32-5). If
you want the test program, mail me.

Matthias

======================================================================
diff --git a/arch/mips/au1000/common/time.c b/arch/mips/au1000/common/time.c
index 1a11aff..e5ec03a 100644
--- a/arch/mips/au1000/common/time.c
+++ b/arch/mips/au1000/common/time.c
@@ -148,11 +148,14 @@ irqreturn_t counter0_irq(int irq, void *dev_id)
        static int jiffie_drift = 0;
        irq_enter();
 
+       write_seqlock(&xtime_lock);
+
        ack_rise_edge_irq(irq); // Do this early so we do not miss next one
 
        if (au_readl(SYS_COUNTER_CNTRL) & SYS_CNTRL_M20) {
                /* should never happen! */
                printk(KERN_WARNING "counter 0 w status error\n");
+               write_sequnlock(&xtime_lock);
                irq_exit();
                return IRQ_NONE;
        }
@@ -221,6 +224,7 @@ irqreturn_t counter0_irq(int irq, void *dev_id)
        }
 #endif
 
+       write_sequnlock(&xtime_lock);
        irq_exit();
        return IRQ_HANDLED;
 }


More information about the erlang-questions mailing list