[erlang-bugs] r15b03-1 SEGV in erts_port_task_execute()

Mikael Pettersson mikpelinux@REDACTED
Wed Jun 18 12:38:33 CEST 2014


One of our nodes running OTP r15b03-1 segfaulted yesterday evening.
Unfortunately it didn't produce a usable core dump (configuration
problem, sigh), but the kernel logged the address of the instruction
that faulted and the address it tried to access.  The segfault
turned out to be in erts_port_task_execute:

int
erts_port_task_execute(ErtsRunQueue *runq, Port **curr_port_pp)
{
    int port_was_enqueued = 0;
    Port *pp;
    ErtsPortTaskQueue *ptqp;
    ErtsPortTask *ptp;
    int res = 0;
    int reds = ERTS_PORT_REDS_EXECUTE;
    erts_aint_t io_tasks_executed = 0;
    int fpe_was_unmasked;
    Uint64 start_time = 0;

    ERTS_SMP_LC_ASSERT(erts_smp_lc_runq_is_locked(runq));

    ERTS_PT_CHK_PORTQ(runq);

    pp = pop_port(runq);
    if (!pp) {
	res = 0;
	goto done;
    }

    ERTS_PORT_NOT_IN_RUNQ(pp);

    *curr_port_pp = pp;

    ASSERT(pp->sched.taskq);
    ASSERT(pp->sched.taskq->first);
    ptqp = pp->sched.taskq;
    pp->sched.taskq = NULL;

    ASSERT(!pp->sched.exe_taskq);
    pp->sched.exe_taskq = ptqp;

    if (erts_smp_port_trylock(pp) == EBUSY) {
	erts_smp_runq_unlock(runq);
	erts_smp_port_lock(pp);
	erts_smp_runq_lock(runq);
    }
    
    if (erts_sched_stat.enabled) {
	ErtsSchedulerData *esdp = erts_get_scheduler_data();
	Uint old = ERTS_PORT_SCHED_ID(pp, esdp->no);
	int migrated = old && old != esdp->no;

	erts_smp_spin_lock(&erts_sched_stat.lock);
	erts_sched_stat.prio[ERTS_PORT_PRIO_LEVEL].total_executed++;
	erts_sched_stat.prio[ERTS_PORT_PRIO_LEVEL].executed++;
	if (migrated) {
	    erts_sched_stat.prio[ERTS_PORT_PRIO_LEVEL].total_migrated++;
	    erts_sched_stat.prio[ERTS_PORT_PRIO_LEVEL].migrated++;
	}
	erts_smp_spin_unlock(&erts_sched_stat.lock);
    }

    /* trace port scheduling, in */
    if (IS_TRACED_FL(pp, F_TRACE_SCHED_PORTS)) {
	trace_sched_ports(pp, am_in);
    }

    ERTS_SMP_LC_ASSERT(erts_lc_is_port_locked(pp));

    ERTS_PT_CHK_PRES_PORTQ(runq, pp);
    ptp = pop_task(ptqp);

At this point ptqp is NULL, so the initial load in the pop_task()
code faults.  This is not a debug build, so the assertions above
didn't catch this condition.

I don't know if this is repeatable; we've never seen it before.
The machine was doing a lot of port I/O at the time (generating
pdf report files).

This is mostly an FYI at this point.  If someone thinks they recognize
the problem and can point to a fix in a later release that'd be great.

/Mikael



More information about the erlang-bugs mailing list