Emulator segfault after net_kernel attempts to contact a down node

Wed Apr 29 00:15:17 CEST 2020

That sounds like it. We'll try it right away. Thanks!

On Tue, Apr 28, 2020, 6:09 PM Rickard Green <rickard@REDACTED> wrote:

> On Tue, Apr 28, 2020 at 4:08 PM Jesse Stimpson <
> jstimpson@REDACTED> wrote:
>
>> Hello,
>>
>> We're running a cluster of Erlang 21.2.4 instances in `-sname`
>> distributed mode, with `-proto_dist inet_tls` on Ubuntu 16.04, and have
>> evidence of a segfault in erts. Here's the log printed just before the
>> segfault:
>>
>> 2020-04-28 00:35:47.086 [error] emulator Garbage collecting distribution
>> entry for node 'iswitch@REDACTED' in state: pending connect
>> 2020-04-28 00:35:47.096 [error] <0.67.0> gen_server net_kernel terminated
>> with reason: bad return value:
>> {'EXIT',{badarg,[{erts_internal,abort_connection,['iswitch@REDACTED
>> ',{54554,#Ref<0.0.2304.1>}],[]},{net_kernel,pending_nodedown,4,[{file,"net_kernel.erl"},{line,999}]},{net_kernel,conn_own_exit,3,[{file,"net_kernel.erl"},{line,926}]},{net_kernel,do_handle_ex
>>
>> it,3,[{file,"net_kernel.erl"},{line,894}]},{net_kernel,handle_exit,3,[{file,"net_kernel.erl"},{line,889}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,...}]},...]}}
>>
>> --
>>
>> And from syslog:
>>
>> Apr 28 00:37:50 isw-proxy-x-pro-awsoh01 systemd[1]: iroute.service: Main
>> process exited, code=killed, status=11/SEGV
>>
>> --
>>
>> We also captured the backtrace from the CoreDump:
>>
>> (gdb) bt
>> #0  rbt_delete (root=root@REDACTED=0x7fdc45ac00e8, del=<optimized out>) at
>> beam/erl_ao_firstfit_alloc.c:710
>> #1  0x00000000005f339e in aoff_unlink_free_block (allctr=<optimized out>,
>> blk=<optimized out>) at beam/erl_ao_firstfit_alloc.c:548
>> #2  0x000000000049d8e1 in mbc_free (allctr=0xc6cf40, p=<optimized out>,
>> busy_pcrr_pp=0x7fdc4743eae0) at beam/erl_alloc_util.c:2549
>> #3  0x000000000049e23f in dealloc_block (allctr=allctr@REDACTED=0xc6cf40,
>> ptr=ptr@REDACTED=0x7fdc45aff0f8, fix=fix@REDACTED=0x0,
>> dec_cc_on_redirect=dec_cc_on_redirect@REDACTED=1)
>>     at beam/erl_alloc_util.c:2325
>> #4  0x00000000004a17f0 in dealloc_block (fix=0x0, dec_cc_on_redirect=1,
>> ptr=0x7fdc45aff0f8, allctr=0xc6cf40) at beam/erl_alloc_util.c:2310
>> #5  handle_delayed_dealloc (need_more_work=<optimized out>,
>> thr_prgr_p=<optimized out>, need_thr_progress=0x7fdc4743ebd8, ops_limit=20,
>> use_limit=<optimized out>, allctr_locked=0,
>>     allctr=0xc6cf40) at beam/erl_alloc_util.c:2178
>> #6  erts_alcu_check_delayed_dealloc (allctr=0xc6cf40, limit=limit@REDACTED=1,
>> need_thr_progress=need_thr_progress@REDACTED=0x7fdc4743ebd8,
>> thr_prgr_p=thr_prgr_p@REDACTED=0x7fdc4743ebe0,
>>     more_work=more_work@REDACTED=0x7fdc4743ebdc) at
>> beam/erl_alloc_util.c:2280
>> #7  0x0000000000490323 in erts_alloc_scheduler_handle_delayed_dealloc
>> (vesdp=0x7fdcc7dc2f40, need_thr_progress=need_thr_progress@REDACTED
>> =0x7fdc4743ebd8,
>>     thr_prgr_p=thr_prgr_p@REDACTED=0x7fdc4743ebe0, more_work=more_work@REDACTED=0x7fdc4743ebdc)
>> at beam/erl_alloc.c:1895
>> #8  0x00000000004625f2 in handle_delayed_dealloc_thr_prgr (waiting=0,
>> aux_work=1061, awdp=0x7fdcc7dc3058) at beam/erl_process.c:2100
>> #9  handle_aux_work (awdp=awdp@REDACTED=0x7fdcc7dc3058,
>> orig_aux_work=orig_aux_work@REDACTED=1061, waiting=waiting@REDACTED=0) at
>> beam/erl_process.c:2595
>> #10 0x000000000046050c in erts_schedule () at beam/erl_process.c:9457
>> #11 0x0000000000451ec0 in process_main () at beam/beam_emu.c:690
>> #12 0x000000000044df25 in sched_thread_func (vesdp=0x7fdcc7dc2f40) at
>> beam/erl_process.c:8462
>> #13 0x0000000000682e79 in thr_wrapper (vtwd=0x7ffd0a2b5ff0) at
>> pthread/ethread.c:118
>> #14 0x00007fdd0abe46ba in start_thread (arg=0x7fdc4743f700) at
>> pthread_create.c:333
>> #15 0x00007fdd0a71241d in clone () at
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>
>> --
>>
>> The node printed in the Erlang log, 'iswitch@REDACTED' ,
>> is indeed down, and remains down for long periods of time. However the node
>> that crashed does continuously attempt to contact it via rpc:cast so that
>> it can be aware when the node comes back up.
>>
>> Is anyone aware of recent patches that would address this crash? Or any
>> pointers on where to continue our debugging?
>>
>> Thanks,
>>
>> Jesse Stimpson
>>
>
> Probably this issue fixed in OTP 21.3.8.11:
>
>   OTP-16224    Application(s): erts
>                Related Id(s): ERL-1044
>
>                Fix bug causing VM crash due to memory corruption of
>                distribution entry. Probability of crash increases if
>                Erlang distribution is frequently disconnected and
>                reestablished towards same node names. Bug exists since
>                OTP-21.0.
>
> I'd recommend taking the latest patch though (for OTP 21 that is currently
> OTP 21.3.8.15).
>
> Regards,
> Rickard
> --
> Rickard Green, Erlang/OTP, Ericsson AB
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20200428/15c53c15/attachment.htm>