[erlang-bugs] infinite loop when beam.smp compiled with -O2 on debian lenny

Mikael Pettersson mikpe@REDACTED
Mon May 3 23:54:31 CEST 2010


Chetan Ahuja writes:
 > Hi,
 > 
 >   We hit a bug while running rabbitmq where the beam.smp process was stuck
 > in a tight loop in the erts_poll_info method.
 > The process was eating up 100% of exactly one core (on a multi core box) and
 > rabbitmq was dysfunctional.  Unfortunately
 > I could not create a  small test case to reproduce this condition but it
 > would happen quite frequently while rabbitmq was in
 > operation.
 > 
 > The C code for the function didn't provide any hints on what would have been
 > spinning in that function
 > (first time looking at  this codebase though). Finally looking through the
 >  disassembly in gdb,  (at the point of where our process was spinning) I saw
 > the following  lines in the
 > erts_poll_info_kp method:
 > 
 > 
 > 0x00000000004f0fe9 <erts_poll_info_kp+185>:     nopl   0x0(%rax)
 > 0x00000000004f0ff0 <erts_poll_info_kp+192>:     jmp    0x4f0fe9
 > <erts_poll_info_kp+185>
 > 
 > (Similar assembly code  can be seen  when  the KERNEL_POLL  option is
 > disabled.)
 > 
 >  Clearly the above will trivially spin forever anytime we get into that
 > codepath.  The above
 > looks suspiciously like some code got optimized out by the compiler leaving
 > the crazy
 > loop code.
 > 
 > So I compiled with -O1 and then with no optimization at all.   Withe -O1, I
 > saw a
 > a weird jmp insruction jumping to it's own address:
 > 
 > 0x0000000000517102 <erts_poll_info_kp+60>:      jmp    0x517102
 > <erts_poll_info_kp+60>
 > 
 > With no optimization,   any of those trivial spins did not exist but I
 > didn't analyze the unoptimized
 > code enough to say whether  it can be proven to have an infinite loop (i.e.,
 > whether the optimizing
 > compiler is simply doing it's job vs. this being a compiler bug).
 > 
 > Anyway, this problem exists at least since  erlang-base_12.b.3-dfsg debian
 > package version and has been
 > verified to exists in the  github version as of today.
 > 
 > 
 >  Her'es the gcc  and  debian version info:
 >  $ gcc --version
 > gcc-4.3.real (Debian 4.3.2-1.1) 4.3.2
 > Copyright (C) 2008 Free Software Foundation, Inc.

I looked at the procedure in question (not so easy to locate due to
some "creative" C preprocessor abuse), and noticed an obvious bug:
there's a loop over a linked list that forgets to actually advance
the node pointer to the next element. When optimizing, gcc will notice
that the loop doesn't terminate, omit the body of the loop (the
calculations are dead), which will result in the type of object code
shown above. Thus, it's an Erlang VM bug not a gcc miscompilation.

Try the patch below and let us know if it solves your problem.

/Mikael

--- otp_src_R13B03/erts/emulator/sys/common/erl_poll.c.~1~	2009-03-12 13:16:29.000000000 +0100
+++ otp_src_R13B03/erts/emulator/sys/common/erl_poll.c	2010-05-03 23:41:32.000000000 +0200
@@ -2404,6 +2404,7 @@ ERTS_POLL_EXPORT(erts_poll_info)(ErtsPol
 	while (urqbp) {
 	    size += sizeof(ErtsPollSetUpdateRequestsBlock);
 	    pending_updates += urqbp->len;
+	    urqbp = urqbp->next;
 	}
     }
 #endif


More information about the erlang-bugs mailing list