[erlang-bugs] r15b03-1 SEGV in erts_port_task_schedule()

Fri Aug 8 15:37:12 CEST 2014

On 08/08/2014 01:14 PM, Mikael Pettersson wrote:
> Rickard Green writes:
>   > On Tue, Jul 29, 2014 at 4:30 PM, Mikael Pettersson <mikpelinux@REDACTED> wrote:
>   > > Mikael Pettersson writes:
>   > >  > This is a followup to my previous report in
>   > >  > <http://erlang.org/pipermail/erlang-bugs/2014-June/004451.html>,
>   > >  > but it's for a different function in erl_port_task.c.
>   > >  >
>   > >  > We've gotten a new SEGV with r15b03-1.  This time we managed to
>   > >  > capture a truncated core dump (just threads list and registers,
>   > >  > no thread stacks or heap memory):
>   > >  >
>   > >  > Program terminated with signal 11, Segmentation fault.
>   > >  > #0  enqueue_task (ptp=<optimized out>,
>   > >  >     ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
>   > >  >     at beam/erl_port_task.c:327
>   > >  > 327         ptp->prev = ptqp->last;
>   > >  > (gdb) bt
>   > >  > #0  enqueue_task (ptp=<optimized out>,
>   > >  >     ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
>   > >  >     at beam/erl_port_task.c:327
>   > >  > #1  erts_port_task_schedule (id=<optimized out>,
>   > >  >     id@REDACTED=<error reading variable: Cannot access memory at address 0x7f8efdeb8318>,
>   > >  >     pthp=<error reading variable: Cannot access memory at address 0x7f8efdeb82c0>,
>   > >  >     type=<error reading variable: Cannot access memory at address 0x7f8efdeb82cc>,
>   > >  >     event=<error reading variable: Cannot access memory at address 0x7f8efdeb82d0>,
>   > >  >     event_data=<error reading variable: Cannot access memory at address 0x7f8efdeb82d8>)
>   > >  >     at beam/erl_port_task.c:615
>   > >  > (gdb)
>   > >  >
>   > >  > The code that faulted is
>   > >  >
>   > >  >    0x00000000004b8203 <+419>:   mov    0x10(%r15),%rax
>   > >  >    0x00000000004b8207 <+423>:   mov    0x10(%rsp),%rbx
>   > >  >    0x00000000004b820c <+428>:   movq   $0x0,0x8(%rbx)
>   > >  > => 0x00000000004b8214 <+436>:   mov    0x8(%rax),%rcx
>   > >  >    0x00000000004b8218 <+440>:   mov    %rax,0x10(%rbx)
>   > >  >    0x00000000004b821c <+444>:   mov    %rcx,(%rbx)
>   > >  >
>   > >  > which is enqueue_task() [line 327] as inlined in erts_port_task_schedule()
>   > >  > [line 615].  At this point, %rax is zero according to gdb's registers dump.
>   > >  >
>   > >  > The relevant part of erts_port_task_schedule() is:
>   > >  >
>   > >  > ==snip==
>   > >  >     if (!pp->sched.taskq)
>   > >  >      pp->sched.taskq = port_taskq_init(port_taskq_alloc(), pp);
>   > >  >
>   > >  >     ASSERT(ptp);
>   > >  >
>   > >  >     ptp->type = type;
>   > >  >     ptp->event = event;
>   > >  >     ptp->event_data = event_data;
>   > >  >
>   > >  >     set_handle(ptp, pthp);
>   > >  >
>   > >  >     switch (type) {
>   > >  >     case ERTS_PORT_TASK_FREE:
>   > >  >      erl_exit(ERTS_ABORT_EXIT,
>   > >  >               "erts_port_task_schedule(): Cannot schedule free task\n");
>   > >  >      break;
>   > >  >     case ERTS_PORT_TASK_INPUT:
>   > >  >     case ERTS_PORT_TASK_OUTPUT:
>   > >  >     case ERTS_PORT_TASK_EVENT:
>   > >  >      erts_smp_atomic_inc_relb(&erts_port_task_outstanding_io_tasks);
>   > >  >      /* Fall through... */
>   > >  >     default:
>   > >  >      enqueue_task(pp->sched.taskq, ptp);
>   > >  >      break;
>   > >  >     }
>   > >  > ==snip==
>   > >  >
>   > >  > The SEGV implies that pp->sched.taskq is NULL at the call to enqueue_task().
>   > >  >
>   > >  > The erts_smp_atomic_inc_relb() and set_handle() calls do not affect *pp,
>   > >  > and I don't see any aliasing between *ptp and *pp, so the assignments to
>   > >  > *ptp do not affect *pp either.
>   > >  >
>   > >  > So for pp->sched.taskq to be NULL at the bottom it would have to be NULL
>   > >  > after the call to port_taskq_init(), which implies that port_taskq_alloc()
>   > >  > returned NULL.
>   > >  >
>   > >  > port_taskq_alloc() is generated via ERTS_SCHED_PREF_QUICK_ALLOC_IMPL;
>   > >  > if one expands that it becomes:
>   > >  >
>   > >  > void erts_alloc_n_enomem(ErtsAlcType_t,Uint)
>   > >  >      __attribute__((noreturn));
>   > >  >
>   > >  > static __inline__
>   > >  > void *erts_alloc(ErtsAlcType_t type, Uint size)
>   > >  > {
>   > >  >     void *res;
>   > >  >     res = (*erts_allctrs[(((type) >> (0)) & (15))].alloc)(
>   > >  >      (((type) >> (7)) & (255)),
>   > >  >      erts_allctrs[(((type) >> (0)) & (15))].extra,
>   > >  >      size);
>   > >  >     if (!res)
>   > >  >      erts_alloc_n_enomem((((type) >> (7)) & (255)), size);
>   > >  >     return res;
>   > >  > }
>   > >  >
>   > >  > static __inline__ ErtsPortTaskQueue * port_taskq_alloc(void)
>   > >  > {
>   > >  >     ErtsPortTaskQueue *res = port_taskq_pre_alloc();
>   > >  >     if (!res)
>   > >  >      res = erts_alloc((4564), sizeof(ErtsPortTaskQueue));
>   > >  >     return res;
>   > >  > }
>   > >  >
>   > >  > But given this code, I don't see how erts_alloc() or port_taskq_alloc()
>   > >  > could ever return NULL.
>   > >  >
>   > >  > Which leads me to suspect that there's a concurrency bug that's
>   > >  > causing *pp to be clobbered behind our backs.
>   > >  >
>   > >  > Ideas?
>   > >
>   >
>   > Thanks for the excellent bug-report! I've found a concurrency bug (as
>   > you suspected) that is likely to have caused the crash you got.
>   >
>   > The fix can be found in the rickard/port-emigrate-bug/OTP-12084 branch
>   > in my github repo
>   > <https://github.com/rickard-green/otp/tree/rickard/port-emigrate-bug/OTP-12084>.
>   > The fix is based on the OTP_R15B03-1 tag. I've only briefly tested the
>   > fix, but will test it more thoroughly. If further changes are needed
>   > I'll post here again.
>
> Thanks Rickard!  The fix looks sane enough; is it safe (but possibly
> incomplete) to use right now, or do you want us to wait until you've
> done more testing?
>

It is safe to use.

> BTW, I have a debug patch in my own r15 branch which complains if it
> detects a mis-match when the runq lock is re-taken, and it triggered
> once this week when I ran mnesia's test suite.
>

I'll do the same test. Please let me know if it should trigger for you 
with the port-emigrate-bug branch.

Regards,
Rickard

> /Mikael
>

-- 
Rickard Green, Erlang/OTP, Ericsson AB.