erlang (rabbitmq) generating core on Solaris SPARC
Mikael Pettersson
mikpelinux@REDACTED
Thu May 14 12:09:24 CEST 2020
On Thu, May 14, 2020 at 9:32 AM Pooja Desai <pooja.desai10@REDACTED> wrote:
>
> Hi Mikael,
>
>
> Please find flies you requested in attachment as erl_files.tar.gz (compressed as facing issue with mail size)
>
> Normal build option is:
>
> # gcc -Werror=undef -Werror=implicit -Werror=return-type -m64 -g -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -c beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.o
>
> after your suggestion I updated it as below to generate erl_alloc_util file:
>
> # gcc -Werror=undef -Werror=implicit -Werror=return-type -m64 -g -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -E beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.i
>
> Also one thing I missed to mention, we are using gcc version 4.9.2 (GCC) for building on solaris SPARC as erlang doesn't support Sun's native compiler.
I've been able to reproduce the non-atomic code for those 64-bit loads
in cpool_insert() using gcc-4.9 cross compilers to sparc64-linux, but
gcc-5.5/6.5/7.5/8.4/9.3 all emit correct code as far as I can tell.
So the solution is to upgrade your gcc (I suggest 9.3.0) and rebuild
your Erlang/OTP VM with that.
/Mikael
>
> Thanks & Regards,
> Pooja
>
> On Tue, May 12, 2020 at 10:44 PM Mikael Pettersson <mikpelinux@REDACTED> wrote:
>>
>> On Tue, May 12, 2020 at 4:18 PM Pooja Desai <pooja.desai10@REDACTED> wrote:
>> >
>> > Hi,
>> >
>> >
>> >
>> > Thanks for response Mikael
>> >
>> > As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.
>> >
>> >
>> >
>> > But I have some doubts,
>> >
>> > 1. If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?
>> >
>> > 2. As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.
>> >
>> >
>>
>> Breaking up a 64-bit load into two 32-bit loads loses atomicity with
>> any concurrent store into that location, meaning the read may end up
>> observing a result composed of 32 bit from the old value and 32 bit
>> from the newly stored value, whereas the code expects to see either
>> the old or the new, but never this mixture. This can happen also on a
>> single-threaded CPU with preemptive multitasking.
>>
>> To move forward on the issue, I think you need to recreate the
>> pre-processed source for erl_alloc_util.c. To do that:
>> 1. Compile Erlang/OTP as usual, starting from a pristine source
>> directory (no left-overs from a previous build, best is to start fresh
>> somewhere), but pass "V=1" to make. Save the output from "make" in a
>> file.
>> 2. Note the step where it compiles erl_alloc_util.c.
>> 3. Reexecute that step, but replace any "-c" with "-E" and "-o
>> erl_alloc_util.o" with "-o erl_alloc_util.i".
>> 4. Please send this ".i" file, together with the exact build steps and
>> configuration options you used, and
>> "erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
>> to me.
>>
>> My theory is that Erlang/OTP selects the wrong low-level primitives
>> for this platform.
>>
>>
>> >
>> >
>> > Thanks & Regards,
>> >
>> > Pooja
>> >
>> >
>> > On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <mikpelinux@REDACTED> wrote:
>> >>
>> >> Hello Pooja,
>> >>
>> >> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <pooja.desai10@REDACTED> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
>> >>
>> >> This looks like a 64-bit build, but the code doesn't look similar to
>> >> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
>> >>
>> >>
>> >> > (dbx) where
>> >> >
>> >> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>> >> >
>> >> > [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>> >> >
>> >> > [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
>> >> >
>> >> > [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
>> >> >
>> >> > [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
>> >> >
>> >> > at 0x1000622c0
>> >> >
>> >> > [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
>> >> >
>> >> > [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>> >> >
>> >> > [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
>> >> >
>> >> > [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>> >> >
>> >> > [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>> >> >
>> >> >
>> >> >
>> >> > This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
>> >> >
>> >> > In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
>> >> >
>> >> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
>> >> >
>> >> > Core is generated while starting this demon.
>> >> >
>> >> >
>> >> > I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
>> >> >
>> >> > Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>> >> >
>> >> > Let me know any further information is required, pasting full core dump information below:
>> >> >
>> >> > debugging core file of beam.smp (64-bit) from hostname01
>> >> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
>> >> > initial argv:
>> >> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
>> >> > threading model: native threads
>> >> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
>> >> > ffffffff004631b0
>> >>
>> >> Ok, this tells us the address was unmapped. (It's not an alignment
>> >> fault, another common issue on SPARC.)
>> >>
>> >>
>> >> >
>> >> > C++ symbol demangling enabled
>> >> >
>> >> > # stack
>> >> >
>> >> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
>> >> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
>> >> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
>> >> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
>> >> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
>> >> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
>> >> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
>> >> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
>> >> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
>> >> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>> >> >
>> >> > #############################################################################
>> >> >
>> >> > # registers
>> >> >
>> >> > %g0 = 0x0000000000000000 %l0 = 0xffffffff7a4307a0
>> >> > %g1 = 0xffffffff004631a1 %l1 = 0x0000000000000000
>> >> > %g2 = 0x0000000000000000 %l2 = 0x0000000000000000
>> >> > %g3 = 0x000000010051c798 %l3 = 0x0000000000000000
>> >> > %g4 = 0xffffffff004631a0 %l4 = 0x0000000000000000
>> >> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
>> >>
>> >> This is interesting. Notice how the low 32-bits 004631a0 show up in
>> >> three variations:
>> >> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
>> >> firstfit_carrier_pool global variable)
>> >> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
>> >> with all-bits-one)
>> >> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
>> >>
>> >> > %g6 = 0x0000000000000000 %l6 = 0x0000000000000000
>> >> > %g7 = 0xffffffff39401a40 %l7 = 0x0000000000000000
>> >> > %o0 = 0x000000010051c500 %i0 = 0x000000010051c500
>> >> > %o1 = 0xffffffff7a400000 %i1 = 0xffffffff7a400000
>> >> > %o2 = 0x00000000000676c0 %i2 = 0xffffffff7a441de8
>> >> > %o3 = 0xffffffff7a400018 %i3 = 0xffffffff7c903818
>> >> > %o4 = 0x00000000000007b9 %i4 = 0x0000000000000000
>> >> > %o5 = 0x000000010051c790 %i5 = 0x0000000000000023
>> >> > %o6 = 0xffffffff7c902eb1 %i6 = 0xffffffff7c902f61
>> >> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
>> >> >
>> >> > %ccr = 0x44 xcc=nZvc icc=nZvc
>> >> > %y = 0x0000000000000000
>> >> > %pc = 0x000000010006db14 cpool_insert+0xd0
>> >> > %npc = 0x000000010006db18 cpool_insert+0xd4
>> >> > %sp = 0xffffffff7c902eb1
>> >> > %fp = 0xffffffff7c902f61
>> >> >
>> >> > %asi = 0x82
>> >> > %fprs = 0x00
>> >> >
>> >> > # dissassembly around pc
>> >> >
>> >> > cpool_insert+0xa8: mov %g1, %g2
>> >> > cpool_insert+0xac: ldx [%g5 + 0x10], %g1
>> >> > cpool_insert+0xb0: membar #LoadLoad|#LoadStore
>> >> > cpool_insert+0xb4: ba,pt %xcc, +0x1c <cpool_insert+0xd0>
>> >> > cpool_insert+0xb8: and %g1, -0x4, %g4
>> >>
>> >> > cpool_insert+0xbc: membar #LoadLoad|#LoadStore
>> >> > cpool_insert+0xc0: and %g2, 0x3, %g3
>> >> > cpool_insert+0xc4: brz,pn %g3, +0x1ec <cpool_insert+0x2b0>
>> >> > cpool_insert+0xc8: mov %g2, %g1
>> >> > cpool_insert+0xcc: and %g1, -0x4, %g4
>> >> > cpool_insert+0xd0: ld [%g4 + 0x10], %g1
>> >>
>> >> This is the faulting instruction. We're in the /* Find a predecessor
>> >> to be, and set mod marker on its next ptr */ loop.
>> >>
>> >> > cpool_insert+0xd4: ld [%g4 + 0x14], %g2
>> >> > cpool_insert+0xd8: sllx %g1, 0x20, %g1
>> >> > cpool_insert+0xdc: cmp %g5, %g4
>> >> > cpool_insert+0xe0: bne,pt %xcc, -0x24 <cpool_insert+0xbc>
>> >> > cpool_insert+0xe4: or %g2, %g1, %g2
>> >>
>> >> The above reads a 64-bit "->next" pointer by assembling two adjacent
>> >> 32-bit fields. Weird, but arithmetically Ok.
>> >>
>> >> Two things strike me:
>> >> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
>> >> load another 32 bits, combine", which isn't correct in a multithreaded
>> >> program. The error could be in the compiler, or in the source code.
>> >> 2. In the register dump it was obvious that the high bits of an
>> >> address had been clobbered.
>> >>
>> >> My suspicion is that either Sun's compiler is buggy, or Erlang is
>> >> selecting non thread-safe code in this case.
>> >>
>> >> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
>> >> those 64-bit loads, as expected.
>> >>
>> >> /Mikael
More information about the erlang-questions
mailing list