[erlang-bugs] SEGV in process_main() line 3163 [r15B03]
Mikael Pettersson
mikpelinux@REDACTED
Thu Dec 18 10:58:40 CET 2014
Anthony Ramine writes:
> Le 15 déc. 2014 à 13:16, Mikael Pettersson <mikpelinux@REDACTED> a écrit :
>
> > [2nd attempt to send this, my apologies if you seee this twice]
> >
> > We've had two segfaults now in r15's process_main(), line 3163, which is
> > the register flushing loop just before the current process is swapped out:
> >
> > ==snip==
> > argp = c_p->arg_reg;
> > for (i = c_p->arity - 1; i > 0; i--) {
> > => argp[i] = reg[i];
> > }
> > c_p->arg_reg[0] = r(0);
> > SWAPOUT;
> > ==snip==
> >
> > The core file is unfortunately truncated: I can see the registers at the
> > point of the SEGV, but not inspect any memory. The registers and
> > disassembly are:
> >
> > ==snip==
> > Program terminated with signal 11, Segmentation fault.
> > #0 process_main () at beam/beam_emu.c:3163
> > 3163 beam/beam_emu.c: No such file or directory.
> > (gdb) info reg
> > rax 0x7e7d77fff3f8 139077349274616
> > rbx 0x7f243b82feb8 139793593990840
> > rcx 0x0 0
> > rdx 0x53ba78 5487224
> > rsi 0x7e7d75622030 139077305376816
> > rdi 0x0 0
> > rbp 0x1414400 0x1414400
> > rsp 0x7f2467432cf0 0x7f2467432cf0
> > r8 0x0 0
> > r9 0x0 0
> > r10 0x0 0
> > r11 0x246 582
> > r12 0x7f2471b407c8 139794503174088
> > r13 0x7e7f4309cae0 139085050661600
> > r14 0x7e7f42e57168 139085048279400
> > r15 0xc63f 50751
> > rip 0x5425e4 0x5425e4 <process_main+29892>
> > eflags 0x10202 [ IF RF ]
> > cs 0x33 51
> > ss 0x2b 43
> > ds 0x0 0
> > es 0x0 0
> > fs 0x0 0
> > gs 0x0 0
> > (gdb) disassemble 0x5425a6,0x542610
> > Dump of assembler code from 0x5425a6 to 0x542610:
> > 0x00000000005425a6 <process_main+29830>: mov 0x90(%rbp),%rdx
> > 0x00000000005425ad <process_main+29837>: mov %rax,0x98(%rbp)
> > 0x00000000005425b4 <process_main+29844>: mov %edx,0xa0(%rbp)
> > 0x00000000005425ba <process_main+29850>: mov 0xd0(%rbp),%rcx
> > 0x00000000005425c1 <process_main+29857>: lea -0x1(%rdx),%eax
> > 0x00000000005425c4 <process_main+29860>: mov 0x98(%rbp),%rsi
> > 0x00000000005425cb <process_main+29867>: test %eax,%eax
> > 0x00000000005425cd <process_main+29869>: mov %rcx,0x48(%rsp)
> > 0x00000000005425d2 <process_main+29874>: jle 0x5425fd <process_main+29917>
> > 0x00000000005425d4 <process_main+29876>: cltq
> > 0x00000000005425d6 <process_main+29878>: sub $0x2,%edx
> > 0x00000000005425d9 <process_main+29881>: shl $0x3,%rax
> > 0x00000000005425dd <process_main+29885>: add %rax,%r12
> > 0x00000000005425e0 <process_main+29888>: lea (%rsi,%rax,1),%rax
> > => 0x00000000005425e4 <process_main+29892>: mov (%r12),%rcx
> > 0x00000000005425e8 <process_main+29896>: sub $0x1,%edx
> > 0x00000000005425eb <process_main+29899>: sub $0x8,%r12
> > 0x00000000005425ef <process_main+29903>: mov %rcx,(%rax)
> > 0x00000000005425f2 <process_main+29906>: lea 0x1(%rdx),%ecx
> > 0x00000000005425f5 <process_main+29909>: sub $0x8,%rax
> > 0x00000000005425f9 <process_main+29913>: test %ecx,%ecx
> > 0x00000000005425fb <process_main+29915>: jg 0x5425e4 <process_main+29892>
> > 0x00000000005425fd <process_main+29917>: mov %r15,(%rsi)
> > 0x0000000000542600 <process_main+29920>: mov %r14,0x0(%rbp)
> > 0x0000000000542604 <process_main+29924>: mov $0x8,%esi
> > 0x0000000000542609 <process_main+29929>: mov %r13,0x8(%rbp)
> > 0x000000000054260d <process_main+29933>: mov %rbx,0xe0(%rbp)
> > End of assembler dump.
> > ==snip==
> >
> > I interpret this as follows:
> > 1. c_p == %rbp == 0x1414400
> > 2. &argp[i] == %rax == 0x7e7d77fff3f8
> > from this I deduce that c_p->arg_reg != c_p->def_arg_reg, so it points
> > to a dynamically allocated area separate from *c_p
> > 3. i == c_p->arity - 1 == %rdx == 0x53ba78
> > this is clearly bonkers, and what's causing references into unmapped
> > memory
> > 4. ®[i] == %r12 == 0x7f2471b407c8
> > this is consistent with indexing a frame-local array at 0x53ba78
> >
> > Basically, my conclusion is that c_p->arity has been clobbered, causing
> > out-of-range accesses in this loop.
> >
> > We've had this exact crash twice now, in August and last Thursday (Dec 11).
> >
> > I realize the lack of a complete core dump makes this impossible to debug.
> > What I'm hoping for is that someone might recollect some post-R15 change
> > or fix that might have something to do with unexpected clobbers of process
> > structs.
> >
> > /Mikael
>
> How do you know it's not a NIF doing strange things or whatnot?
I can't know for sure, but I find it unlikely that one of the few NIFs we use
(we use 3 I think) would clobber c_p->arity and nothing else. Given the other
concurrency-related port bug in r15 I find something like that much more likely.
> Did you manage to reproduce it afterwards?
The Dec. incident is a reproducer, of sorts, since the exact same bug
then had occurred twice.
> Did you try with a debug build?
Sorry, no, we only use release builds on our live systems. This doesn't happen
often enough to motivate rebooting them with debug builds right now.
We're just holding our breaths for now and hope to upgrade to r16 in Q1.
/Mikael
More information about the erlang-bugs
mailing list