erlang (rabbitmq) generating core on Solaris SPARC

Tue May 12 19:13:55 CEST 2020

On Tue, May 12, 2020 at 4:18 PM Pooja Desai <pooja.desai10@REDACTED> wrote:
>
> Hi,
>
>
>
> Thanks for response Mikael
>
> As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.
>
>
>
> But I have some doubts,
>
> 1.     If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?
>
> 2.     As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.
>
>

Breaking up a 64-bit load into two 32-bit loads loses atomicity with
any concurrent store into that location, meaning the read may end up
observing a result composed of 32 bit from the old value and 32 bit
from the newly stored value, whereas the code expects to see either
the old or the new, but never this mixture.  This can happen also on a
single-threaded CPU with preemptive multitasking.

To move forward on the issue, I think you need to recreate the
pre-processed source for erl_alloc_util.c.  To do that:
1. Compile Erlang/OTP as usual, starting from a pristine source
directory (no left-overs from a previous build, best is to start fresh
somewhere), but pass "V=1" to make.  Save the output from "make" in a
file.
2. Note the step where it compiles erl_alloc_util.c.
3. Reexecute that step, but replace any "-c" with "-E" and "-o
erl_alloc_util.o" with "-o erl_alloc_util.i".
4. Please send this ".i" file, together with the exact build steps and
configuration options you used, and
"erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
to me.

My theory is that Erlang/OTP selects the wrong low-level primitives
for this platform.

>
>
> Thanks & Regards,
>
> Pooja
>
>
> On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <mikpelinux@REDACTED> wrote:
>>
>> Hello Pooja,
>>
>> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <pooja.desai10@REDACTED> wrote:
>> >
>> > Hi,
>> >
>> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
>>
>> This looks like a 64-bit build, but the code doesn't look similar to
>> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
>>
>>
>> > (dbx) where
>> >
>> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>> >
>> >   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>> >
>> >   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
>> >
>> >   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
>> >
>> >   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
>> >
>> > at 0x1000622c0
>> >
>> >   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
>> >
>> >   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>> >
>> >   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
>> >
>> >   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>> >
>> >   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>> >
>> >
>> >
>> > This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
>> >
>> > In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
>> >
>> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
>> >
>> > Core is generated while starting this demon.
>> >
>> >
>> > I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
>> >
>> > Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>> >
>> > Let me know any further information is required, pasting full core dump information below:
>> >
>> > debugging core file of beam.smp (64-bit) from hostname01
>> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
>> > initial argv:
>> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
>> > threading model: native threads
>> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
>> > ffffffff004631b0
>>
>> Ok, this tells us the address was unmapped.  (It's not an alignment
>> fault, another common issue on SPARC.)
>>
>>
>> >
>> > C++ symbol demangling enabled
>> >
>> > # stack
>> >
>> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
>> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
>> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
>> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
>> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
>> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
>> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
>> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
>> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
>> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>> >
>> > #############################################################################
>> >
>> > # registers
>> >
>> > %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
>> > %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
>> > %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
>> > %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
>> > %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
>> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
>>
>> This is interesting.  Notice how the low 32-bits 004631a0 show up in
>> three variations:
>> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
>> firstfit_carrier_pool global variable)
>> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
>> with all-bits-one)
>> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
>>
>> > %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
>> > %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
>> > %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
>> > %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
>> > %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
>> > %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
>> > %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
>> > %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
>> > %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
>> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
>> >
>> >  %ccr = 0x44 xcc=nZvc icc=nZvc
>> >    %y = 0x0000000000000000
>> >   %pc = 0x000000010006db14 cpool_insert+0xd0
>> >  %npc = 0x000000010006db18 cpool_insert+0xd4
>> >   %sp = 0xffffffff7c902eb1
>> >   %fp = 0xffffffff7c902f61
>> >
>> >  %asi = 0x82
>> > %fprs = 0x00
>> >
>> > # dissassembly around pc
>> >
>> > cpool_insert+0xa8:              mov       %g1, %g2
>> > cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
>> > cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
>> > cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
>> > cpool_insert+0xb8:              and       %g1, -0x4, %g4
>>
>> > cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
>> > cpool_insert+0xc0:              and       %g2, 0x3, %g3
>> > cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
>> > cpool_insert+0xc8:              mov       %g2, %g1
>> > cpool_insert+0xcc:              and       %g1, -0x4, %g4
>> > cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
>>
>> This is the faulting instruction. We're in the /* Find a predecessor
>> to be, and set mod marker on its next ptr */ loop.
>>
>> > cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
>> > cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
>> > cpool_insert+0xdc:              cmp       %g5, %g4
>> > cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
>> > cpool_insert+0xe4:              or        %g2, %g1, %g2
>>
>> The above reads a 64-bit "->next" pointer by assembling two adjacent
>> 32-bit fields.  Weird, but arithmetically Ok.
>>
>> Two things strike me:
>> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
>> load another 32 bits, combine", which isn't correct in a multithreaded
>> program.  The error could be in the compiler, or in the source code.
>> 2. In the register dump it was obvious that the high bits of an
>> address had been clobbered.
>>
>> My suspicion is that either Sun's compiler is buggy, or Erlang is
>> selecting non thread-safe code in this case.
>>
>> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
>> those 64-bit loads, as expected.
>>
>> /Mikael