[erlang-bugs] r15b03-1 SEGV in erts_port_task_schedule()

Mikael Pettersson mikpelinux@REDACTED
Fri Aug 8 13:14:23 CEST 2014


Rickard Green writes:
 > On Tue, Jul 29, 2014 at 4:30 PM, Mikael Pettersson <mikpelinux@REDACTED> wrote:
 > > Mikael Pettersson writes:
 > >  > This is a followup to my previous report in
 > >  > <http://erlang.org/pipermail/erlang-bugs/2014-June/004451.html>,
 > >  > but it's for a different function in erl_port_task.c.
 > >  >
 > >  > We've gotten a new SEGV with r15b03-1.  This time we managed to
 > >  > capture a truncated core dump (just threads list and registers,
 > >  > no thread stacks or heap memory):
 > >  >
 > >  > Program terminated with signal 11, Segmentation fault.
 > >  > #0  enqueue_task (ptp=<optimized out>,
 > >  >     ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
 > >  >     at beam/erl_port_task.c:327
 > >  > 327         ptp->prev = ptqp->last;
 > >  > (gdb) bt
 > >  > #0  enqueue_task (ptp=<optimized out>,
 > >  >     ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
 > >  >     at beam/erl_port_task.c:327
 > >  > #1  erts_port_task_schedule (id=<optimized out>,
 > >  >     id@REDACTED=<error reading variable: Cannot access memory at address 0x7f8efdeb8318>,
 > >  >     pthp=<error reading variable: Cannot access memory at address 0x7f8efdeb82c0>,
 > >  >     type=<error reading variable: Cannot access memory at address 0x7f8efdeb82cc>,
 > >  >     event=<error reading variable: Cannot access memory at address 0x7f8efdeb82d0>,
 > >  >     event_data=<error reading variable: Cannot access memory at address 0x7f8efdeb82d8>)
 > >  >     at beam/erl_port_task.c:615
 > >  > (gdb)
 > >  >
 > >  > The code that faulted is
 > >  >
 > >  >    0x00000000004b8203 <+419>:   mov    0x10(%r15),%rax
 > >  >    0x00000000004b8207 <+423>:   mov    0x10(%rsp),%rbx
 > >  >    0x00000000004b820c <+428>:   movq   $0x0,0x8(%rbx)
 > >  > => 0x00000000004b8214 <+436>:   mov    0x8(%rax),%rcx
 > >  >    0x00000000004b8218 <+440>:   mov    %rax,0x10(%rbx)
 > >  >    0x00000000004b821c <+444>:   mov    %rcx,(%rbx)
 > >  >
 > >  > which is enqueue_task() [line 327] as inlined in erts_port_task_schedule()
 > >  > [line 615].  At this point, %rax is zero according to gdb's registers dump.
 > >  >
 > >  > The relevant part of erts_port_task_schedule() is:
 > >  >
 > >  > ==snip==
 > >  >     if (!pp->sched.taskq)
 > >  >      pp->sched.taskq = port_taskq_init(port_taskq_alloc(), pp);
 > >  >
 > >  >     ASSERT(ptp);
 > >  >
 > >  >     ptp->type = type;
 > >  >     ptp->event = event;
 > >  >     ptp->event_data = event_data;
 > >  >
 > >  >     set_handle(ptp, pthp);
 > >  >
 > >  >     switch (type) {
 > >  >     case ERTS_PORT_TASK_FREE:
 > >  >      erl_exit(ERTS_ABORT_EXIT,
 > >  >               "erts_port_task_schedule(): Cannot schedule free task\n");
 > >  >      break;
 > >  >     case ERTS_PORT_TASK_INPUT:
 > >  >     case ERTS_PORT_TASK_OUTPUT:
 > >  >     case ERTS_PORT_TASK_EVENT:
 > >  >      erts_smp_atomic_inc_relb(&erts_port_task_outstanding_io_tasks);
 > >  >      /* Fall through... */
 > >  >     default:
 > >  >      enqueue_task(pp->sched.taskq, ptp);
 > >  >      break;
 > >  >     }
 > >  > ==snip==
 > >  >
 > >  > The SEGV implies that pp->sched.taskq is NULL at the call to enqueue_task().
 > >  >
 > >  > The erts_smp_atomic_inc_relb() and set_handle() calls do not affect *pp,
 > >  > and I don't see any aliasing between *ptp and *pp, so the assignments to
 > >  > *ptp do not affect *pp either.
 > >  >
 > >  > So for pp->sched.taskq to be NULL at the bottom it would have to be NULL
 > >  > after the call to port_taskq_init(), which implies that port_taskq_alloc()
 > >  > returned NULL.
 > >  >
 > >  > port_taskq_alloc() is generated via ERTS_SCHED_PREF_QUICK_ALLOC_IMPL;
 > >  > if one expands that it becomes:
 > >  >
 > >  > void erts_alloc_n_enomem(ErtsAlcType_t,Uint)
 > >  >      __attribute__((noreturn));
 > >  >
 > >  > static __inline__
 > >  > void *erts_alloc(ErtsAlcType_t type, Uint size)
 > >  > {
 > >  >     void *res;
 > >  >     res = (*erts_allctrs[(((type) >> (0)) & (15))].alloc)(
 > >  >      (((type) >> (7)) & (255)),
 > >  >      erts_allctrs[(((type) >> (0)) & (15))].extra,
 > >  >      size);
 > >  >     if (!res)
 > >  >      erts_alloc_n_enomem((((type) >> (7)) & (255)), size);
 > >  >     return res;
 > >  > }
 > >  >
 > >  > static __inline__ ErtsPortTaskQueue * port_taskq_alloc(void)
 > >  > {
 > >  >     ErtsPortTaskQueue *res = port_taskq_pre_alloc();
 > >  >     if (!res)
 > >  >      res = erts_alloc((4564), sizeof(ErtsPortTaskQueue));
 > >  >     return res;
 > >  > }
 > >  >
 > >  > But given this code, I don't see how erts_alloc() or port_taskq_alloc()
 > >  > could ever return NULL.
 > >  >
 > >  > Which leads me to suspect that there's a concurrency bug that's
 > >  > causing *pp to be clobbered behind our backs.
 > >  >
 > >  > Ideas?
 > >
 > 
 > Thanks for the excellent bug-report! I've found a concurrency bug (as
 > you suspected) that is likely to have caused the crash you got.
 > 
 > The fix can be found in the rickard/port-emigrate-bug/OTP-12084 branch
 > in my github repo
 > <https://github.com/rickard-green/otp/tree/rickard/port-emigrate-bug/OTP-12084>.
 > The fix is based on the OTP_R15B03-1 tag. I've only briefly tested the
 > fix, but will test it more thoroughly. If further changes are needed
 > I'll post here again.

Thanks Rickard!  The fix looks sane enough; is it safe (but possibly
incomplete) to use right now, or do you want us to wait until you've
done more testing?

BTW, I have a debug patch in my own r15 branch which complains if it
detects a mis-match when the runq lock is re-taken, and it triggered
once this week when I ran mnesia's test suite.

/Mikael



More information about the erlang-bugs mailing list