Emulator segfault after net_kernel attempts to contact a down node

Jesse Stimpson jstimpson@REDACTED
Tue Apr 28 16:07:45 CEST 2020


Hello,

We're running a cluster of Erlang 21.2.4 instances in `-sname` distributed
mode, with `-proto_dist inet_tls` on Ubuntu 16.04, and have evidence of a
segfault in erts. Here's the log printed just before the segfault:

2020-04-28 00:35:47.086 [error] emulator Garbage collecting distribution
entry for node 'iswitch@REDACTED' in state: pending connect
2020-04-28 00:35:47.096 [error] <0.67.0> gen_server net_kernel terminated
with reason: bad return value:
{'EXIT',{badarg,[{erts_internal,abort_connection,['iswitch@REDACTED
',{54554,#Ref<0.0.2304.1>}],[]},{net_kernel,pending_nodedown,4,[{file,"net_kernel.erl"},{line,999}]},{net_kernel,conn_own_exit,3,[{file,"net_kernel.erl"},{line,926}]},{net_kernel,do_handle_ex
it,3,[{file,"net_kernel.erl"},{line,894}]},{net_kernel,handle_exit,3,[{file,"net_kernel.erl"},{line,889}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,...}]},...]}}

--

And from syslog:

Apr 28 00:37:50 isw-proxy-x-pro-awsoh01 systemd[1]: iroute.service: Main
process exited, code=killed, status=11/SEGV

--

We also captured the backtrace from the CoreDump:

(gdb) bt
#0  rbt_delete (root=root@REDACTED=0x7fdc45ac00e8, del=<optimized out>) at
beam/erl_ao_firstfit_alloc.c:710
#1  0x00000000005f339e in aoff_unlink_free_block (allctr=<optimized out>,
blk=<optimized out>) at beam/erl_ao_firstfit_alloc.c:548
#2  0x000000000049d8e1 in mbc_free (allctr=0xc6cf40, p=<optimized out>,
busy_pcrr_pp=0x7fdc4743eae0) at beam/erl_alloc_util.c:2549
#3  0x000000000049e23f in dealloc_block (allctr=allctr@REDACTED=0xc6cf40,
ptr=ptr@REDACTED=0x7fdc45aff0f8, fix=fix@REDACTED=0x0,
dec_cc_on_redirect=dec_cc_on_redirect@REDACTED=1)
    at beam/erl_alloc_util.c:2325
#4  0x00000000004a17f0 in dealloc_block (fix=0x0, dec_cc_on_redirect=1,
ptr=0x7fdc45aff0f8, allctr=0xc6cf40) at beam/erl_alloc_util.c:2310
#5  handle_delayed_dealloc (need_more_work=<optimized out>,
thr_prgr_p=<optimized out>, need_thr_progress=0x7fdc4743ebd8, ops_limit=20,
use_limit=<optimized out>, allctr_locked=0,
    allctr=0xc6cf40) at beam/erl_alloc_util.c:2178
#6  erts_alcu_check_delayed_dealloc (allctr=0xc6cf40, limit=limit@REDACTED=1,
need_thr_progress=need_thr_progress@REDACTED=0x7fdc4743ebd8,
thr_prgr_p=thr_prgr_p@REDACTED=0x7fdc4743ebe0,
    more_work=more_work@REDACTED=0x7fdc4743ebdc) at beam/erl_alloc_util.c:2280
#7  0x0000000000490323 in erts_alloc_scheduler_handle_delayed_dealloc
(vesdp=0x7fdcc7dc2f40, need_thr_progress=need_thr_progress@REDACTED
=0x7fdc4743ebd8,
    thr_prgr_p=thr_prgr_p@REDACTED=0x7fdc4743ebe0,
more_work=more_work@REDACTED=0x7fdc4743ebdc)
at beam/erl_alloc.c:1895
#8  0x00000000004625f2 in handle_delayed_dealloc_thr_prgr (waiting=0,
aux_work=1061, awdp=0x7fdcc7dc3058) at beam/erl_process.c:2100
#9  handle_aux_work (awdp=awdp@REDACTED=0x7fdcc7dc3058,
orig_aux_work=orig_aux_work@REDACTED=1061, waiting=waiting@REDACTED=0) at
beam/erl_process.c:2595
#10 0x000000000046050c in erts_schedule () at beam/erl_process.c:9457
#11 0x0000000000451ec0 in process_main () at beam/beam_emu.c:690
#12 0x000000000044df25 in sched_thread_func (vesdp=0x7fdcc7dc2f40) at
beam/erl_process.c:8462
#13 0x0000000000682e79 in thr_wrapper (vtwd=0x7ffd0a2b5ff0) at
pthread/ethread.c:118
#14 0x00007fdd0abe46ba in start_thread (arg=0x7fdc4743f700) at
pthread_create.c:333
#15 0x00007fdd0a71241d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

--

The node printed in the Erlang log, 'iswitch@REDACTED' , is
indeed down, and remains down for long periods of time. However the node
that crashed does continuously attempt to contact it via rpc:cast so that
it can be aware when the node comes back up.

Is anyone aware of recent patches that would address this crash? Or any
pointers on where to continue our debugging?

Thanks,

Jesse Stimpson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20200428/49506287/attachment.htm>


More information about the erlang-questions mailing list