[erlang-bugs] SEGV in process_main() line 3163 [r15B03]

Mikael Pettersson mikpelinux@REDACTED
Thu Dec 18 10:58:40 CET 2014


Anthony Ramine writes:
 > Le 15 déc. 2014 à 13:16, Mikael Pettersson <mikpelinux@REDACTED> a écrit :
 > 
 > > [2nd attempt to send this, my apologies if you seee this twice]
 > > 
 > > We've had two segfaults now in r15's process_main(), line 3163, which is
 > > the register flushing loop just before the current process is swapped out:
 > > 
 > > ==snip==
 > >     argp = c_p->arg_reg;
 > >     for (i = c_p->arity - 1; i > 0; i--) {
 > > =>       argp[i] = reg[i];
 > >     }
 > >     c_p->arg_reg[0] = r(0);
 > >     SWAPOUT;
 > > ==snip==
 > > 
 > > The core file is unfortunately truncated: I can see the registers at the
 > > point of the SEGV, but not inspect any memory.  The registers and
 > > disassembly are:
 > > 
 > > ==snip==
 > > Program terminated with signal 11, Segmentation fault.
 > > #0  process_main () at beam/beam_emu.c:3163
 > > 3163    beam/beam_emu.c: No such file or directory.
 > > (gdb) info reg
 > > rax            0x7e7d77fff3f8   139077349274616
 > > rbx            0x7f243b82feb8   139793593990840
 > > rcx            0x0      0
 > > rdx            0x53ba78 5487224
 > > rsi            0x7e7d75622030   139077305376816
 > > rdi            0x0      0
 > > rbp            0x1414400        0x1414400
 > > rsp            0x7f2467432cf0   0x7f2467432cf0
 > > r8             0x0      0
 > > r9             0x0      0
 > > r10            0x0      0
 > > r11            0x246    582
 > > r12            0x7f2471b407c8   139794503174088
 > > r13            0x7e7f4309cae0   139085050661600
 > > r14            0x7e7f42e57168   139085048279400
 > > r15            0xc63f   50751
 > > rip            0x5425e4 0x5425e4 <process_main+29892>
 > > eflags         0x10202  [ IF RF ]
 > > cs             0x33     51
 > > ss             0x2b     43
 > > ds             0x0      0
 > > es             0x0      0
 > > fs             0x0      0
 > > gs             0x0      0
 > > (gdb) disassemble 0x5425a6,0x542610
 > > Dump of assembler code from 0x5425a6 to 0x542610:
 > >   0x00000000005425a6 <process_main+29830>:     mov    0x90(%rbp),%rdx
 > >   0x00000000005425ad <process_main+29837>:     mov    %rax,0x98(%rbp)
 > >   0x00000000005425b4 <process_main+29844>:     mov    %edx,0xa0(%rbp)
 > >   0x00000000005425ba <process_main+29850>:     mov    0xd0(%rbp),%rcx
 > >   0x00000000005425c1 <process_main+29857>:     lea    -0x1(%rdx),%eax
 > >   0x00000000005425c4 <process_main+29860>:     mov    0x98(%rbp),%rsi
 > >   0x00000000005425cb <process_main+29867>:     test   %eax,%eax
 > >   0x00000000005425cd <process_main+29869>:     mov    %rcx,0x48(%rsp)
 > >   0x00000000005425d2 <process_main+29874>:     jle    0x5425fd <process_main+29917>
 > >   0x00000000005425d4 <process_main+29876>:     cltq
 > >   0x00000000005425d6 <process_main+29878>:     sub    $0x2,%edx
 > >   0x00000000005425d9 <process_main+29881>:     shl    $0x3,%rax
 > >   0x00000000005425dd <process_main+29885>:     add    %rax,%r12
 > >   0x00000000005425e0 <process_main+29888>:     lea    (%rsi,%rax,1),%rax
 > > => 0x00000000005425e4 <process_main+29892>:     mov    (%r12),%rcx
 > >   0x00000000005425e8 <process_main+29896>:     sub    $0x1,%edx
 > >   0x00000000005425eb <process_main+29899>:     sub    $0x8,%r12
 > >   0x00000000005425ef <process_main+29903>:     mov    %rcx,(%rax)
 > >   0x00000000005425f2 <process_main+29906>:     lea    0x1(%rdx),%ecx
 > >   0x00000000005425f5 <process_main+29909>:     sub    $0x8,%rax
 > >   0x00000000005425f9 <process_main+29913>:     test   %ecx,%ecx
 > >   0x00000000005425fb <process_main+29915>:     jg     0x5425e4 <process_main+29892>
 > >   0x00000000005425fd <process_main+29917>:     mov    %r15,(%rsi)
 > >   0x0000000000542600 <process_main+29920>:     mov    %r14,0x0(%rbp)
 > >   0x0000000000542604 <process_main+29924>:     mov    $0x8,%esi
 > >   0x0000000000542609 <process_main+29929>:     mov    %r13,0x8(%rbp)
 > >   0x000000000054260d <process_main+29933>:     mov    %rbx,0xe0(%rbp)
 > > End of assembler dump.
 > > ==snip==
 > > 
 > > I interpret this as follows:
 > > 1. c_p == %rbp == 0x1414400
 > > 2. &argp[i] == %rax == 0x7e7d77fff3f8
 > >   from this I deduce that c_p->arg_reg != c_p->def_arg_reg, so it points
 > >   to a dynamically allocated area separate from *c_p
 > > 3. i == c_p->arity - 1 == %rdx == 0x53ba78
 > >   this is clearly bonkers, and what's causing references into unmapped
 > >   memory
 > > 4. &reg[i] == %r12 == 0x7f2471b407c8
 > >   this is consistent with indexing a frame-local array at 0x53ba78
 > > 
 > > Basically, my conclusion is that c_p->arity has been clobbered, causing
 > > out-of-range accesses in this loop.
 > > 
 > > We've had this exact crash twice now, in August and last Thursday (Dec 11).
 > > 
 > > I realize the lack of a complete core dump makes this impossible to debug.
 > > What I'm hoping for is that someone might recollect some post-R15 change
 > > or fix that might have something to do with unexpected clobbers of process
 > > structs.
 > > 
 > > /Mikael
 > 
 > How do you know it's not a NIF doing strange things or whatnot?

I can't know for sure, but I find it unlikely that one of the few NIFs we use
(we use 3 I think) would clobber c_p->arity and nothing else.  Given the other
concurrency-related port bug in r15 I find something like that much more likely.

 > Did you manage to reproduce it afterwards?

The Dec. incident is a reproducer, of sorts, since the exact same bug
then had occurred twice.

 > Did you try with a debug build?

Sorry, no, we only use release builds on our live systems.  This doesn't happen
often enough to motivate rebooting them with debug builds right now.

We're just holding our breaths for now and hope to upgrade to r16 in Q1.

/Mikael



More information about the erlang-bugs mailing list