[erlang-bugs] erts_port[].drv_ptr == 0, when erts_port[].status not free

Paul Fisher pfisher@REDACTED
Wed Jul 2 02:24:05 CEST 2008


We have a system where we run lots of linked-in driver ports that get
created/used/closed frequently and sometimes very quickly.  Today when
several open_port/2, port_command/2 and port_close/1 cycles happened
rapid succession, a SIGSEGV occurrect in erl_bif_ddl.c:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1125235040 (LWP 12087)]
0x0000000000449712 in erl_ddll_try_unload_2 (p=0x2aaaab11fc90,
    name_term=659339, options=46912503328425) at beam/erl_bif_ddll.c:592

The emulator was run on a Q6600 (quad-core, 2.4Ghz), and started with +A
8,
and the linked-in driver executes the bulk of its work with
driver_async().
There were continuously 8 driver cycles running for 5-10 seconds before
the
segfault occurred.

(gdb) where
#0  0x0000000000449712 in erl_ddll_try_unload_2 (p=0x2aaaab11fc90,
    name_term=659339, options=46912503328425) at beam/erl_bif_ddll.c:592
#1  0x000000000052337f in process_main () at beam/beam_emu.c:2073
#2  0x000000000049c213 in sched_thread_func (vesdp=0x2ae18cb74f98)
    at beam/erl_process.c:741
#3  0x00000000005b6818 in thr_wrapper (vtwd=0x7fff1eb77de0)
    at common/ethread.c:474
#4  0x00002ae18c530f1a in start_thread () from /lib/libpthread.so.0
#5  0x00002ae18c8135d2 in clone () from /lib/libc.so.6
#6  0x0000000000000000 in ?? ()

So the code at the point of the SIGSEGV @ erl_bif_ddll.c:592 says:

        for (j = 0; j < erts_max_ports; j++) {
=>          if (!(erts_port[j].status &  FREE_PORT_FLAGS)
                && erts_port[j].drv_ptr->handle == dh) {

It appears that the code assumes that if the erts_port array entry being
evaluated during the search has a valid (non-zero) drv_ptr value, if the
entry is not marked as free.  At the time of the crash, this is clearly
not
the case:

(gdb) p j
$8 = 896

(gdb) p erts_port[j]
$7 = {sched = {next = 0x0, prev = 0x0, taskq = 0x0, exe_taskq = 0x0},
  timeout_task = {counter = 0}, refc = {counter = 2}, lock = 0x81b3c8,
  xports = 0x0, id = 14343, connected = 0, caller = 0, data = 0, bp =
0x0,
  nlinks = 0x0, monitors = 0x0, bytes_in = 0, bytes_out = 0, ptimer =
0x0,
  tracer_proc = 18446744073709551611, trace_flags = 0, ioq = {size = 0,
    v_start = 0x0, v_end = 0x0, v_head = 0x0, v_tail = 0x0, v_small = {{
        iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}, {
        iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}, {
        iov_base = 0x0, iov_len = 0}}, b_start = 0x0, b_end = 0x0,
    b_head = 0x0, b_tail = 0x0, b_small = {0x0, 0x0, 0x0, 0x0, 0x0}},
  dist_entry = 0x0, name = 0x0, drv_ptr = 0x0, drv_data = 0, suspended =
0x0,
  linebuf = 0x0, status = 4096, control_flags = 0, reg = 0x0,
  port_data_lock = 0x0}

(gdb) p erts_port[j].drv_ptr
$6 = (ErlDrvEntry *) 0x0


So the real questions are: 1) is whether the assumption built into this
code is correct; and 2) if so, how did we get in the position of
violating
it.  I'd appreciate some insight into what could be going on here, and
where I should can start looking.


-- 
paul




More information about the erlang-bugs mailing list