Emulator segfault after net_kernel attempts to contact a down node

Rickard Green rickard@REDACTED
Wed Apr 29 00:09:45 CEST 2020


On Tue, Apr 28, 2020 at 4:08 PM Jesse Stimpson <
jstimpson@REDACTED> wrote:

> Hello,
>
> We're running a cluster of Erlang 21.2.4 instances in `-sname` distributed
> mode, with `-proto_dist inet_tls` on Ubuntu 16.04, and have evidence of a
> segfault in erts. Here's the log printed just before the segfault:
>
> 2020-04-28 00:35:47.086 [error] emulator Garbage collecting distribution
> entry for node 'iswitch@REDACTED' in state: pending connect
> 2020-04-28 00:35:47.096 [error] <0.67.0> gen_server net_kernel terminated
> with reason: bad return value:
> {'EXIT',{badarg,[{erts_internal,abort_connection,['iswitch@REDACTED
> ',{54554,#Ref<0.0.2304.1>}],[]},{net_kernel,pending_nodedown,4,[{file,"net_kernel.erl"},{line,999}]},{net_kernel,conn_own_exit,3,[{file,"net_kernel.erl"},{line,926}]},{net_kernel,do_handle_ex
>
> it,3,[{file,"net_kernel.erl"},{line,894}]},{net_kernel,handle_exit,3,[{file,"net_kernel.erl"},{line,889}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,...}]},...]}}
>
> --
>
> And from syslog:
>
> Apr 28 00:37:50 isw-proxy-x-pro-awsoh01 systemd[1]: iroute.service: Main
> process exited, code=killed, status=11/SEGV
>
> --
>
> We also captured the backtrace from the CoreDump:
>
> (gdb) bt
> #0  rbt_delete (root=root@REDACTED=0x7fdc45ac00e8, del=<optimized out>) at
> beam/erl_ao_firstfit_alloc.c:710
> #1  0x00000000005f339e in aoff_unlink_free_block (allctr=<optimized out>,
> blk=<optimized out>) at beam/erl_ao_firstfit_alloc.c:548
> #2  0x000000000049d8e1 in mbc_free (allctr=0xc6cf40, p=<optimized out>,
> busy_pcrr_pp=0x7fdc4743eae0) at beam/erl_alloc_util.c:2549
> #3  0x000000000049e23f in dealloc_block (allctr=allctr@REDACTED=0xc6cf40,
> ptr=ptr@REDACTED=0x7fdc45aff0f8, fix=fix@REDACTED=0x0,
> dec_cc_on_redirect=dec_cc_on_redirect@REDACTED=1)
>     at beam/erl_alloc_util.c:2325
> #4  0x00000000004a17f0 in dealloc_block (fix=0x0, dec_cc_on_redirect=1,
> ptr=0x7fdc45aff0f8, allctr=0xc6cf40) at beam/erl_alloc_util.c:2310
> #5  handle_delayed_dealloc (need_more_work=<optimized out>,
> thr_prgr_p=<optimized out>, need_thr_progress=0x7fdc4743ebd8, ops_limit=20,
> use_limit=<optimized out>, allctr_locked=0,
>     allctr=0xc6cf40) at beam/erl_alloc_util.c:2178
> #6  erts_alcu_check_delayed_dealloc (allctr=0xc6cf40, limit=limit@REDACTED=1,
> need_thr_progress=need_thr_progress@REDACTED=0x7fdc4743ebd8,
> thr_prgr_p=thr_prgr_p@REDACTED=0x7fdc4743ebe0,
>     more_work=more_work@REDACTED=0x7fdc4743ebdc) at
> beam/erl_alloc_util.c:2280
> #7  0x0000000000490323 in erts_alloc_scheduler_handle_delayed_dealloc
> (vesdp=0x7fdcc7dc2f40, need_thr_progress=need_thr_progress@REDACTED
> =0x7fdc4743ebd8,
>     thr_prgr_p=thr_prgr_p@REDACTED=0x7fdc4743ebe0, more_work=more_work@REDACTED=0x7fdc4743ebdc)
> at beam/erl_alloc.c:1895
> #8  0x00000000004625f2 in handle_delayed_dealloc_thr_prgr (waiting=0,
> aux_work=1061, awdp=0x7fdcc7dc3058) at beam/erl_process.c:2100
> #9  handle_aux_work (awdp=awdp@REDACTED=0x7fdcc7dc3058,
> orig_aux_work=orig_aux_work@REDACTED=1061, waiting=waiting@REDACTED=0) at
> beam/erl_process.c:2595
> #10 0x000000000046050c in erts_schedule () at beam/erl_process.c:9457
> #11 0x0000000000451ec0 in process_main () at beam/beam_emu.c:690
> #12 0x000000000044df25 in sched_thread_func (vesdp=0x7fdcc7dc2f40) at
> beam/erl_process.c:8462
> #13 0x0000000000682e79 in thr_wrapper (vtwd=0x7ffd0a2b5ff0) at
> pthread/ethread.c:118
> #14 0x00007fdd0abe46ba in start_thread (arg=0x7fdc4743f700) at
> pthread_create.c:333
> #15 0x00007fdd0a71241d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>
> --
>
> The node printed in the Erlang log, 'iswitch@REDACTED' , is
> indeed down, and remains down for long periods of time. However the node
> that crashed does continuously attempt to contact it via rpc:cast so that
> it can be aware when the node comes back up.
>
> Is anyone aware of recent patches that would address this crash? Or any
> pointers on where to continue our debugging?
>
> Thanks,
>
> Jesse Stimpson
>

Probably this issue fixed in OTP 21.3.8.11:

  OTP-16224    Application(s): erts
               Related Id(s): ERL-1044

               Fix bug causing VM crash due to memory corruption of
               distribution entry. Probability of crash increases if
               Erlang distribution is frequently disconnected and
               reestablished towards same node names. Bug exists since
               OTP-21.0.

I'd recommend taking the latest patch though (for OTP 21 that is currently
OTP 21.3.8.15).

Regards,
Rickard
-- 
Rickard Green, Erlang/OTP, Ericsson AB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20200429/bb1753c3/attachment.htm>


More information about the erlang-questions mailing list