[erlang-bugs] r15b03-1 SEGV in erts_port_task_schedule()

Mon Jul 28 16:27:23 CEST 2014

This is a followup to my previous report in
<http://erlang.org/pipermail/erlang-bugs/2014-June/004451.html>,
but it's for a different function in erl_port_task.c.

We've gotten a new SEGV with r15b03-1.  This time we managed to
capture a truncated core dump (just threads list and registers,
no thread stacks or heap memory):

Program terminated with signal 11, Segmentation fault.
#0  enqueue_task (ptp=<optimized out>, 
    ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
    at beam/erl_port_task.c:327
327         ptp->prev = ptqp->last;
(gdb) bt
#0  enqueue_task (ptp=<optimized out>, 
    ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
    at beam/erl_port_task.c:327
#1  erts_port_task_schedule (id=<optimized out>, 
    id@REDACTED=<error reading variable: Cannot access memory at address 0x7f8efdeb8318>, 
    pthp=<error reading variable: Cannot access memory at address 0x7f8efdeb82c0>, 
    type=<error reading variable: Cannot access memory at address 0x7f8efdeb82cc>, 
    event=<error reading variable: Cannot access memory at address 0x7f8efdeb82d0>, 
    event_data=<error reading variable: Cannot access memory at address 0x7f8efdeb82d8>)
    at beam/erl_port_task.c:615
(gdb) 

The code that faulted is

   0x00000000004b8203 <+419>:   mov    0x10(%r15),%rax
   0x00000000004b8207 <+423>:   mov    0x10(%rsp),%rbx
   0x00000000004b820c <+428>:   movq   $0x0,0x8(%rbx)
=> 0x00000000004b8214 <+436>:   mov    0x8(%rax),%rcx
   0x00000000004b8218 <+440>:   mov    %rax,0x10(%rbx)
   0x00000000004b821c <+444>:   mov    %rcx,(%rbx)

which is enqueue_task() [line 327] as inlined in erts_port_task_schedule()
[line 615].  At this point, %rax is zero according to gdb's registers dump.

The relevant part of erts_port_task_schedule() is:

==snip==
    if (!pp->sched.taskq)
	pp->sched.taskq = port_taskq_init(port_taskq_alloc(), pp);

    ASSERT(ptp);

    ptp->type = type;
    ptp->event = event;
    ptp->event_data = event_data;

    set_handle(ptp, pthp);

    switch (type) {
    case ERTS_PORT_TASK_FREE:
	erl_exit(ERTS_ABORT_EXIT,
		 "erts_port_task_schedule(): Cannot schedule free task\n");
	break;
    case ERTS_PORT_TASK_INPUT:
    case ERTS_PORT_TASK_OUTPUT:
    case ERTS_PORT_TASK_EVENT:
	erts_smp_atomic_inc_relb(&erts_port_task_outstanding_io_tasks);
	/* Fall through... */
    default:
	enqueue_task(pp->sched.taskq, ptp);
	break;
    }
==snip==

The SEGV implies that pp->sched.taskq is NULL at the call to enqueue_task().

The erts_smp_atomic_inc_relb() and set_handle() calls do not affect *pp,
and I don't see any aliasing between *ptp and *pp, so the assignments to
*ptp do not affect *pp either.

So for pp->sched.taskq to be NULL at the bottom it would have to be NULL
after the call to port_taskq_init(), which implies that port_taskq_alloc()
returned NULL.

port_taskq_alloc() is generated via ERTS_SCHED_PREF_QUICK_ALLOC_IMPL;
if one expands that it becomes:

void erts_alloc_n_enomem(ErtsAlcType_t,Uint)
     __attribute__((noreturn));

static __inline__
void *erts_alloc(ErtsAlcType_t type, Uint size)
{
    void *res;
    res = (*erts_allctrs[(((type) >> (0)) & (15))].alloc)(
	(((type) >> (7)) & (255)),
	erts_allctrs[(((type) >> (0)) & (15))].extra,
	size);
    if (!res)
	erts_alloc_n_enomem((((type) >> (7)) & (255)), size);
    return res;
}

static __inline__ ErtsPortTaskQueue * port_taskq_alloc(void)
{
    ErtsPortTaskQueue *res = port_taskq_pre_alloc();
    if (!res)
	res = erts_alloc((4564), sizeof(ErtsPortTaskQueue));
    return res;
}

But given this code, I don't see how erts_alloc() or port_taskq_alloc()
could ever return NULL.

Which leads me to suspect that there's a concurrency bug that's
causing *pp to be clobbered behind our backs.

Ideas?

/Mikael