erlang (rabbitmq) generating core on Solaris SPARC

Mon May 11 19:06:18 CEST 2020

Hello Pooja,

On Mon, May 11, 2020 at 8:10 AM Pooja Desai <pooja.desai10@REDACTED> wrote:
>
> Hi,
>
> Facing erlang core issue on solaris SPARC setup while running RabbitMQ

This looks like a 64-bit build, but the code doesn't look similar to
what I get with gcc-9.3, so I'm assuming you used Sun's compiler?

> (dbx) where
>
> =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>
>   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>
>   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
>
>   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
>
>   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
>
> at 0x1000622c0
>
>   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
>
>   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>
>   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
>
>   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>
>   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>
>
>
> This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
>
> In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
>
> First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
>
> Core is generated while starting this demon.
>
>
> I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
>
> Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>
> Let me know any further information is required, pasting full core dump information below:
>
> debugging core file of beam.smp (64-bit) from hostname01
> file: temp_dir/erlang/erts-10.6/bin/beam.smp
> initial argv:
> /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
> threading model: native threads
> status: process terminated by SIGSEGV (Segmentation Fault), addr=
> ffffffff004631b0

Ok, this tells us the address was unmapped.  (It's not an alignment
fault, another common issue on SPARC.)

>
> C++ symbol demangling enabled
>
> # stack
>
> cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
> dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
> erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
> erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
> handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
> erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
> process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
> sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
> thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
> libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>
> #############################################################################
>
> # registers
>
> %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
> %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
> %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
> %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
> %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
> %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000

This is interesting.  Notice how the low 32-bits 004631a0 show up in
three variations:
1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
firstfit_carrier_pool global variable)
2. ffffffff004631a0 (the above, but with the high 32 bits replaced
with all-bits-one)
3. ffffffff004631a1 (the above, but with a tag in the low bit)

> %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
> %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
> %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
> %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
> %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
> %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
> %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
> %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
> %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
> %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
>
>  %ccr = 0x44 xcc=nZvc icc=nZvc
>    %y = 0x0000000000000000
>   %pc = 0x000000010006db14 cpool_insert+0xd0
>  %npc = 0x000000010006db18 cpool_insert+0xd4
>   %sp = 0xffffffff7c902eb1
>   %fp = 0xffffffff7c902f61
>
>  %asi = 0x82
> %fprs = 0x00
>
> # dissassembly around pc
>
> cpool_insert+0xa8:              mov       %g1, %g2
> cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
> cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
> cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
> cpool_insert+0xb8:              and       %g1, -0x4, %g4

> cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
> cpool_insert+0xc0:              and       %g2, 0x3, %g3
> cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
> cpool_insert+0xc8:              mov       %g2, %g1
> cpool_insert+0xcc:              and       %g1, -0x4, %g4
> cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1

This is the faulting instruction. We're in the /* Find a predecessor
to be, and set mod marker on its next ptr */ loop.

> cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
> cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
> cpool_insert+0xdc:              cmp       %g5, %g4
> cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
> cpool_insert+0xe4:              or        %g2, %g1, %g2

The above reads a 64-bit "->next" pointer by assembling two adjacent
32-bit fields.  Weird, but arithmetically Ok.

Two things strike me:
1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
load another 32 bits, combine", which isn't correct in a multithreaded
program.  The error could be in the compiler, or in the source code.
2. In the register dump it was obvious that the high bits of an
address had been clobbered.

My suspicion is that either Sun's compiler is buggy, or Erlang is
selecting non thread-safe code in this case.

On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
those 64-bit loads, as expected.

/Mikael