[erlang-bugs] r15b03-1 SEGV in erts_port_task_schedule()
Rickard Green
rickard@REDACTED
Fri Aug 8 15:37:12 CEST 2014
On 08/08/2014 01:14 PM, Mikael Pettersson wrote:
> Rickard Green writes:
> > On Tue, Jul 29, 2014 at 4:30 PM, Mikael Pettersson <mikpelinux@REDACTED> wrote:
> > > Mikael Pettersson writes:
> > > > This is a followup to my previous report in
> > > > <http://erlang.org/pipermail/erlang-bugs/2014-June/004451.html>,
> > > > but it's for a different function in erl_port_task.c.
> > > >
> > > > We've gotten a new SEGV with r15b03-1. This time we managed to
> > > > capture a truncated core dump (just threads list and registers,
> > > > no thread stacks or heap memory):
> > > >
> > > > Program terminated with signal 11, Segmentation fault.
> > > > #0 enqueue_task (ptp=<optimized out>,
> > > > ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
> > > > at beam/erl_port_task.c:327
> > > > 327 ptp->prev = ptqp->last;
> > > > (gdb) bt
> > > > #0 enqueue_task (ptp=<optimized out>,
> > > > ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
> > > > at beam/erl_port_task.c:327
> > > > #1 erts_port_task_schedule (id=<optimized out>,
> > > > id@REDACTED=<error reading variable: Cannot access memory at address 0x7f8efdeb8318>,
> > > > pthp=<error reading variable: Cannot access memory at address 0x7f8efdeb82c0>,
> > > > type=<error reading variable: Cannot access memory at address 0x7f8efdeb82cc>,
> > > > event=<error reading variable: Cannot access memory at address 0x7f8efdeb82d0>,
> > > > event_data=<error reading variable: Cannot access memory at address 0x7f8efdeb82d8>)
> > > > at beam/erl_port_task.c:615
> > > > (gdb)
> > > >
> > > > The code that faulted is
> > > >
> > > > 0x00000000004b8203 <+419>: mov 0x10(%r15),%rax
> > > > 0x00000000004b8207 <+423>: mov 0x10(%rsp),%rbx
> > > > 0x00000000004b820c <+428>: movq $0x0,0x8(%rbx)
> > > > => 0x00000000004b8214 <+436>: mov 0x8(%rax),%rcx
> > > > 0x00000000004b8218 <+440>: mov %rax,0x10(%rbx)
> > > > 0x00000000004b821c <+444>: mov %rcx,(%rbx)
> > > >
> > > > which is enqueue_task() [line 327] as inlined in erts_port_task_schedule()
> > > > [line 615]. At this point, %rax is zero according to gdb's registers dump.
> > > >
> > > > The relevant part of erts_port_task_schedule() is:
> > > >
> > > > ==snip==
> > > > if (!pp->sched.taskq)
> > > > pp->sched.taskq = port_taskq_init(port_taskq_alloc(), pp);
> > > >
> > > > ASSERT(ptp);
> > > >
> > > > ptp->type = type;
> > > > ptp->event = event;
> > > > ptp->event_data = event_data;
> > > >
> > > > set_handle(ptp, pthp);
> > > >
> > > > switch (type) {
> > > > case ERTS_PORT_TASK_FREE:
> > > > erl_exit(ERTS_ABORT_EXIT,
> > > > "erts_port_task_schedule(): Cannot schedule free task\n");
> > > > break;
> > > > case ERTS_PORT_TASK_INPUT:
> > > > case ERTS_PORT_TASK_OUTPUT:
> > > > case ERTS_PORT_TASK_EVENT:
> > > > erts_smp_atomic_inc_relb(&erts_port_task_outstanding_io_tasks);
> > > > /* Fall through... */
> > > > default:
> > > > enqueue_task(pp->sched.taskq, ptp);
> > > > break;
> > > > }
> > > > ==snip==
> > > >
> > > > The SEGV implies that pp->sched.taskq is NULL at the call to enqueue_task().
> > > >
> > > > The erts_smp_atomic_inc_relb() and set_handle() calls do not affect *pp,
> > > > and I don't see any aliasing between *ptp and *pp, so the assignments to
> > > > *ptp do not affect *pp either.
> > > >
> > > > So for pp->sched.taskq to be NULL at the bottom it would have to be NULL
> > > > after the call to port_taskq_init(), which implies that port_taskq_alloc()
> > > > returned NULL.
> > > >
> > > > port_taskq_alloc() is generated via ERTS_SCHED_PREF_QUICK_ALLOC_IMPL;
> > > > if one expands that it becomes:
> > > >
> > > > void erts_alloc_n_enomem(ErtsAlcType_t,Uint)
> > > > __attribute__((noreturn));
> > > >
> > > > static __inline__
> > > > void *erts_alloc(ErtsAlcType_t type, Uint size)
> > > > {
> > > > void *res;
> > > > res = (*erts_allctrs[(((type) >> (0)) & (15))].alloc)(
> > > > (((type) >> (7)) & (255)),
> > > > erts_allctrs[(((type) >> (0)) & (15))].extra,
> > > > size);
> > > > if (!res)
> > > > erts_alloc_n_enomem((((type) >> (7)) & (255)), size);
> > > > return res;
> > > > }
> > > >
> > > > static __inline__ ErtsPortTaskQueue * port_taskq_alloc(void)
> > > > {
> > > > ErtsPortTaskQueue *res = port_taskq_pre_alloc();
> > > > if (!res)
> > > > res = erts_alloc((4564), sizeof(ErtsPortTaskQueue));
> > > > return res;
> > > > }
> > > >
> > > > But given this code, I don't see how erts_alloc() or port_taskq_alloc()
> > > > could ever return NULL.
> > > >
> > > > Which leads me to suspect that there's a concurrency bug that's
> > > > causing *pp to be clobbered behind our backs.
> > > >
> > > > Ideas?
> > >
> >
> > Thanks for the excellent bug-report! I've found a concurrency bug (as
> > you suspected) that is likely to have caused the crash you got.
> >
> > The fix can be found in the rickard/port-emigrate-bug/OTP-12084 branch
> > in my github repo
> > <https://github.com/rickard-green/otp/tree/rickard/port-emigrate-bug/OTP-12084>.
> > The fix is based on the OTP_R15B03-1 tag. I've only briefly tested the
> > fix, but will test it more thoroughly. If further changes are needed
> > I'll post here again.
>
> Thanks Rickard! The fix looks sane enough; is it safe (but possibly
> incomplete) to use right now, or do you want us to wait until you've
> done more testing?
>
It is safe to use.
> BTW, I have a debug patch in my own r15 branch which complains if it
> detects a mis-match when the runq lock is re-taken, and it triggered
> once this week when I ran mnesia's test suite.
>
I'll do the same test. Please let me know if it should trigger for you
with the port-emigrate-bug branch.
Regards,
Rickard
> /Mikael
>
--
Rickard Green, Erlang/OTP, Ericsson AB.
More information about the erlang-bugs
mailing list