erlang (rabbitmq) generating core on Solaris SPARC
Pooja Desai
pooja.desai10@REDACTED
Tue May 12 16:17:56 CEST 2020
Hi,
Thanks for response Mikael
As per your suggestion I am trying to write similar code to conclude if
there is some issue with Solaris SPARC compiler.
But I have some doubts,
1. If there is problem with compiler then we should be able to see this
crash everywhere else also, any idea why its only reproduced here?
2. As I understand your explanation it reads 64 bits by assembling two
adjacent 32 bits fields. Will it really cause problem in
multi-threaded program? Considering while context switching to another
thread, OS will save current context of the thread (and hence registers)
and will bring back when thread is active again.
Thanks & Regards,
Pooja
On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <mikpelinux@REDACTED>
wrote:
> Hello Pooja,
>
> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <pooja.desai10@REDACTED>
> wrote:
> >
> > Hi,
> >
> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
>
> This looks like a 64-bit build, but the code doesn't look similar to
> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
>
>
> > (dbx) where
> >
> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850,
> 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
> >
> > [2] abandon_carrier(0x1004efd40, 0xffffffff75600000,
> 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
> >
> > [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40),
> at 0x10006e958
> >
> > [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1,
> 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44,
> 0x8000000000000007), at 0x100075244
> >
> > [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620,
> 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464,
> 0xffffffff3a82a5d0),
> >
> > at 0x1000622c0
> >
> > [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400,
> 0x4e5ce123), at 0x1002a6044
> >
> > [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80,
> 0x2, 0x2), at 0x1002a3040
> >
> > [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a,
> 0xffffffff38f00438, 0x3), at 0x1002901bc
> >
> > [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0,
> 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
> >
> > [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48,
> 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
> >
> >
> >
> > This issue is extremely intermittent so I am not able to reproduce it
> with debug build. But on our test setup I have seen this core twice only
> for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86,
> Windows etc.) with similar test environment things are working fine.
> >
> > In two instances when I faced this issue we are restarting Rabbitmq
> server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq.
> This performs 2 operations,
> >
> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not
> already running ( I guess in background this will also start epmd) and then
> start rabbitmq-server in detached mode.
> >
> > Core is generated while starting this demon.
> >
> >
> > I checked code around abandon_carrier("
> https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c")
> but nothing changed in that area recently. So I am really clueless
> situation.
> >
> > Please le me know if anyone faced similar issue in past or have any idea
> around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
> >
> > Let me know any further information is required, pasting full core dump
> information below:
> >
> > debugging core file of beam.smp (64-bit) from hostname01
> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
> > initial argv:
> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
> > threading model: native threads
> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
> > ffffffff004631b0
>
> Ok, this tells us the address was unmapped. (It's not an alignment
> fault, another common issue on SPARC.)
>
>
> >
> > C++ symbol demangling enabled
> >
> > # stack
> >
> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8,
> ffffffff7c903818, 0, 23)
> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0,
> 10051c500)
> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40,
> ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20,
> ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464,
> ffffffff3b729bd0)
> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241,
> 100294204)
> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40,
> 100000, 1)
> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980,
> 100038da0)
> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
> >
> >
> #############################################################################
> >
> > # registers
> >
> > %g0 = 0x0000000000000000 %l0 = 0xffffffff7a4307a0
> > %g1 = 0xffffffff004631a1 %l1 = 0x0000000000000000
> > %g2 = 0x0000000000000000 %l2 = 0x0000000000000000
> > %g3 = 0x000000010051c798 %l3 = 0x0000000000000000
> > %g4 = 0xffffffff004631a0 %l4 = 0x0000000000000000
> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 =
> 0x0000000000000000
>
> This is interesting. Notice how the low 32-bits 004631a0 show up in
> three variations:
> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
> firstfit_carrier_pool global variable)
> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
> with all-bits-one)
> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
>
> > %g6 = 0x0000000000000000 %l6 = 0x0000000000000000
> > %g7 = 0xffffffff39401a40 %l7 = 0x0000000000000000
> > %o0 = 0x000000010051c500 %i0 = 0x000000010051c500
> > %o1 = 0xffffffff7a400000 %i1 = 0xffffffff7a400000
> > %o2 = 0x00000000000676c0 %i2 = 0xffffffff7a441de8
> > %o3 = 0xffffffff7a400018 %i3 = 0xffffffff7c903818
> > %o4 = 0x00000000000007b9 %i4 = 0x0000000000000000
> > %o5 = 0x000000010051c790 %i5 = 0x0000000000000023
> > %o6 = 0xffffffff7c902eb1 %i6 = 0xffffffff7c902f61
> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958
> dealloc_block.part.17+0x1c0
> >
> > %ccr = 0x44 xcc=nZvc icc=nZvc
> > %y = 0x0000000000000000
> > %pc = 0x000000010006db14 cpool_insert+0xd0
> > %npc = 0x000000010006db18 cpool_insert+0xd4
> > %sp = 0xffffffff7c902eb1
> > %fp = 0xffffffff7c902f61
> >
> > %asi = 0x82
> > %fprs = 0x00
> >
> > # dissassembly around pc
> >
> > cpool_insert+0xa8: mov %g1, %g2
> > cpool_insert+0xac: ldx [%g5 + 0x10], %g1
> > cpool_insert+0xb0: membar #LoadLoad|#LoadStore
> > cpool_insert+0xb4: ba,pt %xcc, +0x1c
> <cpool_insert+0xd0>
> > cpool_insert+0xb8: and %g1, -0x4, %g4
>
> > cpool_insert+0xbc: membar #LoadLoad|#LoadStore
> > cpool_insert+0xc0: and %g2, 0x3, %g3
> > cpool_insert+0xc4: brz,pn %g3, +0x1ec
> <cpool_insert+0x2b0>
> > cpool_insert+0xc8: mov %g2, %g1
> > cpool_insert+0xcc: and %g1, -0x4, %g4
> > cpool_insert+0xd0: ld [%g4 + 0x10], %g1
>
> This is the faulting instruction. We're in the /* Find a predecessor
> to be, and set mod marker on its next ptr */ loop.
>
> > cpool_insert+0xd4: ld [%g4 + 0x14], %g2
> > cpool_insert+0xd8: sllx %g1, 0x20, %g1
> > cpool_insert+0xdc: cmp %g5, %g4
> > cpool_insert+0xe0: bne,pt %xcc, -0x24
> <cpool_insert+0xbc>
> > cpool_insert+0xe4: or %g2, %g1, %g2
>
> The above reads a 64-bit "->next" pointer by assembling two adjacent
> 32-bit fields. Weird, but arithmetically Ok.
>
> Two things strike me:
> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
> load another 32 bits, combine", which isn't correct in a multithreaded
> program. The error could be in the compiler, or in the source code.
> 2. In the register dump it was obvious that the high bits of an
> address had been clobbered.
>
> My suspicion is that either Sun's compiler is buggy, or Erlang is
> selecting non thread-safe code in this case.
>
> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
> those 64-bit loads, as expected.
>
> /Mikael
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20200512/a521692b/attachment.htm>
More information about the erlang-questions
mailing list