erlang (rabbitmq) generating core on Solaris SPARC

Mikael Pettersson mikpelinux@REDACTED
Wed May 20 12:56:51 CEST 2020


On Tue, May 19, 2020 at 6:24 PM Pooja Desai <pooja.desai10@REDACTED> wrote:
>
> Hi Mikael,
>
> gcc bug mention above is not specific to any platform but problematic disassembly is only generated for solaris sparc. Any idea why only solaris sparc erlang is affected by this?

As I wrote, the bug affects all strict-alignment targets, and SPARC is
one of those.  Most older RISC designs are strict-alignment.
x86 is not strict-alignment for general purpose instructions, but some
of its vector instructions are.

/Mikael

> Actually to minimise impact on testing/sock we are thinking about only rebuilding erlang on solaris sparc for now as issue is only faced on solaris platform. So checking your expert opinion, do you see any problem with this approach?
>
> Thanks & Regards,
> Pooja
>
> On Fri, May 15, 2020 at 1:51 PM Pooja Desai <pooja.desai10@REDACTED> wrote:
>>
>> Thanks Mikael,
>>
>> As per your suggestion I am rebuilding erlang with newer gcc version. Thanks for helping with this.
>>
>> Thanks & Regards,
>> Pooja
>>
>> On Fri, May 15, 2020 at 3:20 AM Mikael Pettersson <mikpelinux@REDACTED> wrote:
>>>
>>> On Thu, May 14, 2020 at 12:09 PM Mikael Pettersson <mikpelinux@REDACTED> wrote:
>>> >
>>> > On Thu, May 14, 2020 at 9:32 AM Pooja Desai <pooja.desai10@REDACTED> wrote:
>>> > >
>>> > > Hi Mikael,
>>> > >
>>> > >
>>> > > Please find flies you requested in attachment as erl_files.tar.gz (compressed as facing issue with mail size)
>>> > >
>>> > > Normal build option is:
>>> > >
>>> > > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -c beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.o
>>> > >
>>> > > after your suggestion I updated it as below to generate erl_alloc_util file:
>>> > >
>>> > > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -E beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.i
>>> > >
>>> > > Also one thing I missed to mention, we are using gcc version 4.9.2 (GCC) for building on solaris SPARC as erlang doesn't support Sun's native compiler.
>>> >
>>> > I've been able to reproduce the non-atomic code for those 64-bit loads
>>> > in cpool_insert() using gcc-4.9 cross compilers to sparc64-linux, but
>>> > gcc-5.5/6.5/7.5/8.4/9.3 all emit correct code as far as I can tell.
>>> >
>>> > So the solution is to upgrade your gcc (I suggest 9.3.0) and rebuild
>>> > your Erlang/OTP VM with that.
>>> >
>>> > /Mikael
>>>
>>> I created a reduced test case from erl_alloc.i, and it turns out
>>> Erlang/OTP was hit by
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70424, which affects
>>> gcc-4.9 (all versions) and gcc-5.x (x < 4), on all strict-alignment
>>> targets.
>>>
>>> So the recommendation stands: upgrade your gcc.
>>>
>>> > > Thanks & Regards,
>>> > > Pooja
>>> > >
>>> > > On Tue, May 12, 2020 at 10:44 PM Mikael Pettersson <mikpelinux@REDACTED> wrote:
>>> > >>
>>> > >> On Tue, May 12, 2020 at 4:18 PM Pooja Desai <pooja.desai10@REDACTED> wrote:
>>> > >> >
>>> > >> > Hi,
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > Thanks for response Mikael
>>> > >> >
>>> > >> > As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > But I have some doubts,
>>> > >> >
>>> > >> > 1.     If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?
>>> > >> >
>>> > >> > 2.     As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.
>>> > >> >
>>> > >> >
>>> > >>
>>> > >> Breaking up a 64-bit load into two 32-bit loads loses atomicity with
>>> > >> any concurrent store into that location, meaning the read may end up
>>> > >> observing a result composed of 32 bit from the old value and 32 bit
>>> > >> from the newly stored value, whereas the code expects to see either
>>> > >> the old or the new, but never this mixture.  This can happen also on a
>>> > >> single-threaded CPU with preemptive multitasking.
>>> > >>
>>> > >> To move forward on the issue, I think you need to recreate the
>>> > >> pre-processed source for erl_alloc_util.c.  To do that:
>>> > >> 1. Compile Erlang/OTP as usual, starting from a pristine source
>>> > >> directory (no left-overs from a previous build, best is to start fresh
>>> > >> somewhere), but pass "V=1" to make.  Save the output from "make" in a
>>> > >> file.
>>> > >> 2. Note the step where it compiles erl_alloc_util.c.
>>> > >> 3. Reexecute that step, but replace any "-c" with "-E" and "-o
>>> > >> erl_alloc_util.o" with "-o erl_alloc_util.i".
>>> > >> 4. Please send this ".i" file, together with the exact build steps and
>>> > >> configuration options you used, and
>>> > >> "erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
>>> > >> to me.
>>> > >>
>>> > >> My theory is that Erlang/OTP selects the wrong low-level primitives
>>> > >> for this platform.
>>> > >>
>>> > >>
>>> > >> >
>>> > >> >
>>> > >> > Thanks & Regards,
>>> > >> >
>>> > >> > Pooja
>>> > >> >
>>> > >> >
>>> > >> > On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <mikpelinux@REDACTED> wrote:
>>> > >> >>
>>> > >> >> Hello Pooja,
>>> > >> >>
>>> > >> >> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <pooja.desai10@REDACTED> wrote:
>>> > >> >> >
>>> > >> >> > Hi,
>>> > >> >> >
>>> > >> >> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
>>> > >> >>
>>> > >> >> This looks like a 64-bit build, but the code doesn't look similar to
>>> > >> >> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
>>> > >> >>
>>> > >> >>
>>> > >> >> > (dbx) where
>>> > >> >> >
>>> > >> >> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>>> > >> >> >
>>> > >> >> >   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>>> > >> >> >
>>> > >> >> >   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
>>> > >> >> >
>>> > >> >> >   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
>>> > >> >> >
>>> > >> >> >   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
>>> > >> >> >
>>> > >> >> > at 0x1000622c0
>>> > >> >> >
>>> > >> >> >   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
>>> > >> >> >
>>> > >> >> >   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>>> > >> >> >
>>> > >> >> >   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
>>> > >> >> >
>>> > >> >> >   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>>> > >> >> >
>>> > >> >> >   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>>> > >> >> >
>>> > >> >> >
>>> > >> >> >
>>> > >> >> > This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
>>> > >> >> >
>>> > >> >> > In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
>>> > >> >> >
>>> > >> >> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
>>> > >> >> >
>>> > >> >> > Core is generated while starting this demon.
>>> > >> >> >
>>> > >> >> >
>>> > >> >> > I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
>>> > >> >> >
>>> > >> >> > Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>>> > >> >> >
>>> > >> >> > Let me know any further information is required, pasting full core dump information below:
>>> > >> >> >
>>> > >> >> > debugging core file of beam.smp (64-bit) from hostname01
>>> > >> >> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
>>> > >> >> > initial argv:
>>> > >> >> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
>>> > >> >> > threading model: native threads
>>> > >> >> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
>>> > >> >> > ffffffff004631b0
>>> > >> >>
>>> > >> >> Ok, this tells us the address was unmapped.  (It's not an alignment
>>> > >> >> fault, another common issue on SPARC.)
>>> > >> >>
>>> > >> >>
>>> > >> >> >
>>> > >> >> > C++ symbol demangling enabled
>>> > >> >> >
>>> > >> >> > # stack
>>> > >> >> >
>>> > >> >> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
>>> > >> >> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
>>> > >> >> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
>>> > >> >> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
>>> > >> >> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
>>> > >> >> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
>>> > >> >> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
>>> > >> >> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
>>> > >> >> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
>>> > >> >> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>>> > >> >> >
>>> > >> >> > #############################################################################
>>> > >> >> >
>>> > >> >> > # registers
>>> > >> >> >
>>> > >> >> > %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
>>> > >> >> > %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
>>> > >> >> > %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
>>> > >> >> > %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
>>> > >> >> > %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
>>> > >> >> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
>>> > >> >>
>>> > >> >> This is interesting.  Notice how the low 32-bits 004631a0 show up in
>>> > >> >> three variations:
>>> > >> >> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
>>> > >> >> firstfit_carrier_pool global variable)
>>> > >> >> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
>>> > >> >> with all-bits-one)
>>> > >> >> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
>>> > >> >>
>>> > >> >> > %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
>>> > >> >> > %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
>>> > >> >> > %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
>>> > >> >> > %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
>>> > >> >> > %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
>>> > >> >> > %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
>>> > >> >> > %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
>>> > >> >> > %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
>>> > >> >> > %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
>>> > >> >> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
>>> > >> >> >
>>> > >> >> >  %ccr = 0x44 xcc=nZvc icc=nZvc
>>> > >> >> >    %y = 0x0000000000000000
>>> > >> >> >   %pc = 0x000000010006db14 cpool_insert+0xd0
>>> > >> >> >  %npc = 0x000000010006db18 cpool_insert+0xd4
>>> > >> >> >   %sp = 0xffffffff7c902eb1
>>> > >> >> >   %fp = 0xffffffff7c902f61
>>> > >> >> >
>>> > >> >> >  %asi = 0x82
>>> > >> >> > %fprs = 0x00
>>> > >> >> >
>>> > >> >> > # dissassembly around pc
>>> > >> >> >
>>> > >> >> > cpool_insert+0xa8:              mov       %g1, %g2
>>> > >> >> > cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
>>> > >> >> > cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
>>> > >> >> > cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
>>> > >> >> > cpool_insert+0xb8:              and       %g1, -0x4, %g4
>>> > >> >>
>>> > >> >> > cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
>>> > >> >> > cpool_insert+0xc0:              and       %g2, 0x3, %g3
>>> > >> >> > cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
>>> > >> >> > cpool_insert+0xc8:              mov       %g2, %g1
>>> > >> >> > cpool_insert+0xcc:              and       %g1, -0x4, %g4
>>> > >> >> > cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
>>> > >> >>
>>> > >> >> This is the faulting instruction. We're in the /* Find a predecessor
>>> > >> >> to be, and set mod marker on its next ptr */ loop.
>>> > >> >>
>>> > >> >> > cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
>>> > >> >> > cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
>>> > >> >> > cpool_insert+0xdc:              cmp       %g5, %g4
>>> > >> >> > cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
>>> > >> >> > cpool_insert+0xe4:              or        %g2, %g1, %g2
>>> > >> >>
>>> > >> >> The above reads a 64-bit "->next" pointer by assembling two adjacent
>>> > >> >> 32-bit fields.  Weird, but arithmetically Ok.
>>> > >> >>
>>> > >> >> Two things strike me:
>>> > >> >> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
>>> > >> >> load another 32 bits, combine", which isn't correct in a multithreaded
>>> > >> >> program.  The error could be in the compiler, or in the source code.
>>> > >> >> 2. In the register dump it was obvious that the high bits of an
>>> > >> >> address had been clobbered.
>>> > >> >>
>>> > >> >> My suspicion is that either Sun's compiler is buggy, or Erlang is
>>> > >> >> selecting non thread-safe code in this case.
>>> > >> >>
>>> > >> >> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
>>> > >> >> those 64-bit loads, as expected.
>>> > >> >>
>>> > >> >> /Mikael


More information about the erlang-questions mailing list