[erlang-bugs] r15b03-1 SEGV in erts_port_task_schedule()
Mikael Pettersson
mikpelinux@REDACTED
Fri Aug 8 13:14:23 CEST 2014
Rickard Green writes:
> On Tue, Jul 29, 2014 at 4:30 PM, Mikael Pettersson <mikpelinux@REDACTED> wrote:
> > Mikael Pettersson writes:
> > > This is a followup to my previous report in
> > > <http://erlang.org/pipermail/erlang-bugs/2014-June/004451.html>,
> > > but it's for a different function in erl_port_task.c.
> > >
> > > We've gotten a new SEGV with r15b03-1. This time we managed to
> > > capture a truncated core dump (just threads list and registers,
> > > no thread stacks or heap memory):
> > >
> > > Program terminated with signal 11, Segmentation fault.
> > > #0 enqueue_task (ptp=<optimized out>,
> > > ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
> > > at beam/erl_port_task.c:327
> > > 327 ptp->prev = ptqp->last;
> > > (gdb) bt
> > > #0 enqueue_task (ptp=<optimized out>,
> > > ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
> > > at beam/erl_port_task.c:327
> > > #1 erts_port_task_schedule (id=<optimized out>,
> > > id@REDACTED=<error reading variable: Cannot access memory at address 0x7f8efdeb8318>,
> > > pthp=<error reading variable: Cannot access memory at address 0x7f8efdeb82c0>,
> > > type=<error reading variable: Cannot access memory at address 0x7f8efdeb82cc>,
> > > event=<error reading variable: Cannot access memory at address 0x7f8efdeb82d0>,
> > > event_data=<error reading variable: Cannot access memory at address 0x7f8efdeb82d8>)
> > > at beam/erl_port_task.c:615
> > > (gdb)
> > >
> > > The code that faulted is
> > >
> > > 0x00000000004b8203 <+419>: mov 0x10(%r15),%rax
> > > 0x00000000004b8207 <+423>: mov 0x10(%rsp),%rbx
> > > 0x00000000004b820c <+428>: movq $0x0,0x8(%rbx)
> > > => 0x00000000004b8214 <+436>: mov 0x8(%rax),%rcx
> > > 0x00000000004b8218 <+440>: mov %rax,0x10(%rbx)
> > > 0x00000000004b821c <+444>: mov %rcx,(%rbx)
> > >
> > > which is enqueue_task() [line 327] as inlined in erts_port_task_schedule()
> > > [line 615]. At this point, %rax is zero according to gdb's registers dump.
> > >
> > > The relevant part of erts_port_task_schedule() is:
> > >
> > > ==snip==
> > > if (!pp->sched.taskq)
> > > pp->sched.taskq = port_taskq_init(port_taskq_alloc(), pp);
> > >
> > > ASSERT(ptp);
> > >
> > > ptp->type = type;
> > > ptp->event = event;
> > > ptp->event_data = event_data;
> > >
> > > set_handle(ptp, pthp);
> > >
> > > switch (type) {
> > > case ERTS_PORT_TASK_FREE:
> > > erl_exit(ERTS_ABORT_EXIT,
> > > "erts_port_task_schedule(): Cannot schedule free task\n");
> > > break;
> > > case ERTS_PORT_TASK_INPUT:
> > > case ERTS_PORT_TASK_OUTPUT:
> > > case ERTS_PORT_TASK_EVENT:
> > > erts_smp_atomic_inc_relb(&erts_port_task_outstanding_io_tasks);
> > > /* Fall through... */
> > > default:
> > > enqueue_task(pp->sched.taskq, ptp);
> > > break;
> > > }
> > > ==snip==
> > >
> > > The SEGV implies that pp->sched.taskq is NULL at the call to enqueue_task().
> > >
> > > The erts_smp_atomic_inc_relb() and set_handle() calls do not affect *pp,
> > > and I don't see any aliasing between *ptp and *pp, so the assignments to
> > > *ptp do not affect *pp either.
> > >
> > > So for pp->sched.taskq to be NULL at the bottom it would have to be NULL
> > > after the call to port_taskq_init(), which implies that port_taskq_alloc()
> > > returned NULL.
> > >
> > > port_taskq_alloc() is generated via ERTS_SCHED_PREF_QUICK_ALLOC_IMPL;
> > > if one expands that it becomes:
> > >
> > > void erts_alloc_n_enomem(ErtsAlcType_t,Uint)
> > > __attribute__((noreturn));
> > >
> > > static __inline__
> > > void *erts_alloc(ErtsAlcType_t type, Uint size)
> > > {
> > > void *res;
> > > res = (*erts_allctrs[(((type) >> (0)) & (15))].alloc)(
> > > (((type) >> (7)) & (255)),
> > > erts_allctrs[(((type) >> (0)) & (15))].extra,
> > > size);
> > > if (!res)
> > > erts_alloc_n_enomem((((type) >> (7)) & (255)), size);
> > > return res;
> > > }
> > >
> > > static __inline__ ErtsPortTaskQueue * port_taskq_alloc(void)
> > > {
> > > ErtsPortTaskQueue *res = port_taskq_pre_alloc();
> > > if (!res)
> > > res = erts_alloc((4564), sizeof(ErtsPortTaskQueue));
> > > return res;
> > > }
> > >
> > > But given this code, I don't see how erts_alloc() or port_taskq_alloc()
> > > could ever return NULL.
> > >
> > > Which leads me to suspect that there's a concurrency bug that's
> > > causing *pp to be clobbered behind our backs.
> > >
> > > Ideas?
> >
>
> Thanks for the excellent bug-report! I've found a concurrency bug (as
> you suspected) that is likely to have caused the crash you got.
>
> The fix can be found in the rickard/port-emigrate-bug/OTP-12084 branch
> in my github repo
> <https://github.com/rickard-green/otp/tree/rickard/port-emigrate-bug/OTP-12084>.
> The fix is based on the OTP_R15B03-1 tag. I've only briefly tested the
> fix, but will test it more thoroughly. If further changes are needed
> I'll post here again.
Thanks Rickard! The fix looks sane enough; is it safe (but possibly
incomplete) to use right now, or do you want us to wait until you've
done more testing?
BTW, I have a debug patch in my own r15 branch which complains if it
detects a mis-match when the runq lock is re-taken, and it triggered
once this week when I ran mnesia's test suite.
/Mikael
More information about the erlang-bugs
mailing list