erlang (rabbitmq) generating core on Solaris SPARC

Pooja Desai pooja.desai10@REDACTED
Wed May 20 13:06:31 CEST 2020


Ok, thanks for explaining and helping with this issue.

Thanks & Regards,
Pooja

On Wed, May 20, 2020, 4:27 PM Mikael Pettersson <mikpelinux@REDACTED>
wrote:

> On Tue, May 19, 2020 at 6:24 PM Pooja Desai <pooja.desai10@REDACTED>
> wrote:
> >
> > Hi Mikael,
> >
> > gcc bug mention above is not specific to any platform but problematic
> disassembly is only generated for solaris sparc. Any idea why only solaris
> sparc erlang is affected by this?
>
> As I wrote, the bug affects all strict-alignment targets, and SPARC is
> one of those.  Most older RISC designs are strict-alignment.
> x86 is not strict-alignment for general purpose instructions, but some
> of its vector instructions are.
>
> /Mikael
>
> > Actually to minimise impact on testing/sock we are thinking about only
> rebuilding erlang on solaris sparc for now as issue is only faced on
> solaris platform. So checking your expert opinion, do you see any problem
> with this approach?
> >
> > Thanks & Regards,
> > Pooja
> >
> > On Fri, May 15, 2020 at 1:51 PM Pooja Desai <pooja.desai10@REDACTED>
> wrote:
> >>
> >> Thanks Mikael,
> >>
> >> As per your suggestion I am rebuilding erlang with newer gcc version.
> Thanks for helping with this.
> >>
> >> Thanks & Regards,
> >> Pooja
> >>
> >> On Fri, May 15, 2020 at 3:20 AM Mikael Pettersson <mikpelinux@REDACTED>
> wrote:
> >>>
> >>> On Thu, May 14, 2020 at 12:09 PM Mikael Pettersson <
> mikpelinux@REDACTED> wrote:
> >>> >
> >>> > On Thu, May 14, 2020 at 9:32 AM Pooja Desai <pooja.desai10@REDACTED>
> wrote:
> >>> > >
> >>> > > Hi Mikael,
> >>> > >
> >>> > >
> >>> > > Please find flies you requested in attachment as erl_files.tar.gz
> (compressed as facing issue with mail size)
> >>> > >
> >>> > > Normal build option is:
> >>> > >
> >>> > > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64
> -g  -O3 -fomit-frame-pointer
> -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE
> -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall
> -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement
> -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS
> -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam
> -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe
> -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal
> -I../include/internal/sparc-sun-solaris2.10 -c beam/erl_alloc_util.c -o
> obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.o
> >>> > >
> >>> > > after your suggestion I updated it as below to generate
> erl_alloc_util file:
> >>> > >
> >>> > > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64
> -g  -O3 -fomit-frame-pointer
> -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE
> -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall
> -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement
> -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS
> -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam
> -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe
> -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal
> -I../include/internal/sparc-sun-solaris2.10 -E beam/erl_alloc_util.c -o
> obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.i
> >>> > >
> >>> > > Also one thing I missed to mention, we are using gcc version 4.9.2
> (GCC) for building on solaris SPARC as erlang doesn't support Sun's native
> compiler.
> >>> >
> >>> > I've been able to reproduce the non-atomic code for those 64-bit
> loads
> >>> > in cpool_insert() using gcc-4.9 cross compilers to sparc64-linux, but
> >>> > gcc-5.5/6.5/7.5/8.4/9.3 all emit correct code as far as I can tell.
> >>> >
> >>> > So the solution is to upgrade your gcc (I suggest 9.3.0) and rebuild
> >>> > your Erlang/OTP VM with that.
> >>> >
> >>> > /Mikael
> >>>
> >>> I created a reduced test case from erl_alloc.i, and it turns out
> >>> Erlang/OTP was hit by
> >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70424, which affects
> >>> gcc-4.9 (all versions) and gcc-5.x (x < 4), on all strict-alignment
> >>> targets.
> >>>
> >>> So the recommendation stands: upgrade your gcc.
> >>>
> >>> > > Thanks & Regards,
> >>> > > Pooja
> >>> > >
> >>> > > On Tue, May 12, 2020 at 10:44 PM Mikael Pettersson <
> mikpelinux@REDACTED> wrote:
> >>> > >>
> >>> > >> On Tue, May 12, 2020 at 4:18 PM Pooja Desai <
> pooja.desai10@REDACTED> wrote:
> >>> > >> >
> >>> > >> > Hi,
> >>> > >> >
> >>> > >> >
> >>> > >> >
> >>> > >> > Thanks for response Mikael
> >>> > >> >
> >>> > >> > As per your suggestion I am trying to write similar code to
> conclude if there is some issue with Solaris SPARC compiler.
> >>> > >> >
> >>> > >> >
> >>> > >> >
> >>> > >> > But I have some doubts,
> >>> > >> >
> >>> > >> > 1.     If there is problem with compiler then we should be able
> to see this crash everywhere else also, any idea why its only reproduced
> here?
> >>> > >> >
> >>> > >> > 2.     As I understand your explanation it reads 64 bits by
> assembling two adjacent 32 bits fields. Will it really cause problem in
> multi-threaded program? Considering while context switching to another
> thread, OS will save current context of the thread (and hence registers)
> and will bring back when thread is active again.
> >>> > >> >
> >>> > >> >
> >>> > >>
> >>> > >> Breaking up a 64-bit load into two 32-bit loads loses atomicity
> with
> >>> > >> any concurrent store into that location, meaning the read may end
> up
> >>> > >> observing a result composed of 32 bit from the old value and 32
> bit
> >>> > >> from the newly stored value, whereas the code expects to see
> either
> >>> > >> the old or the new, but never this mixture.  This can happen also
> on a
> >>> > >> single-threaded CPU with preemptive multitasking.
> >>> > >>
> >>> > >> To move forward on the issue, I think you need to recreate the
> >>> > >> pre-processed source for erl_alloc_util.c.  To do that:
> >>> > >> 1. Compile Erlang/OTP as usual, starting from a pristine source
> >>> > >> directory (no left-overs from a previous build, best is to start
> fresh
> >>> > >> somewhere), but pass "V=1" to make.  Save the output from "make"
> in a
> >>> > >> file.
> >>> > >> 2. Note the step where it compiles erl_alloc_util.c.
> >>> > >> 3. Reexecute that step, but replace any "-c" with "-E" and "-o
> >>> > >> erl_alloc_util.o" with "-o erl_alloc_util.i".
> >>> > >> 4. Please send this ".i" file, together with the exact build
> steps and
> >>> > >> configuration options you used, and
> >>> > >> "erts/sparc-sun-solaris11/config.h" (I'm guessing the file name
> here)
> >>> > >> to me.
> >>> > >>
> >>> > >> My theory is that Erlang/OTP selects the wrong low-level
> primitives
> >>> > >> for this platform.
> >>> > >>
> >>> > >>
> >>> > >> >
> >>> > >> >
> >>> > >> > Thanks & Regards,
> >>> > >> >
> >>> > >> > Pooja
> >>> > >> >
> >>> > >> >
> >>> > >> > On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <
> mikpelinux@REDACTED> wrote:
> >>> > >> >>
> >>> > >> >> Hello Pooja,
> >>> > >> >>
> >>> > >> >> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <
> pooja.desai10@REDACTED> wrote:
> >>> > >> >> >
> >>> > >> >> > Hi,
> >>> > >> >> >
> >>> > >> >> > Facing erlang core issue on solaris SPARC setup while
> running RabbitMQ
> >>> > >> >>
> >>> > >> >> This looks like a 64-bit build, but the code doesn't look
> similar to
> >>> > >> >> what I get with gcc-9.3, so I'm assuming you used Sun's
> compiler?
> >>> > >> >>
> >>> > >> >>
> >>> > >> >> > (dbx) where
> >>> > >> >> >
> >>> > >> >> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850,
> 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
> >>> > >> >> >
> >>> > >> >> >   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000,
> 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
> >>> > >> >> >
> >>> > >> >> >   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0,
> 0x1004efd40), at 0x10006e958
> >>> > >> >> >
> >>> > >> >> >   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1,
> 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44,
> 0x8000000000000007), at 0x100075244
> >>> > >> >> >
> >>> > >> >> >   [5]
> erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620,
> 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464,
> 0xffffffff3a82a5d0),
> >>> > >> >> >
> >>> > >> >> > at 0x1000622c0
> >>> > >> >> >
> >>> > >> >> >   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2,
> 0x100400, 0x4e5ce123), at 0x1002a6044
> >>> > >> >> >
> >>> > >> >> >   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9,
> 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
> >>> > >> >> >
> >>> > >> >> >   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0,
> 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
> >>> > >> >> >
> >>> > >> >> >   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0,
> 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
> >>> > >> >> >
> >>> > >> >> >   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0,
> 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
> >>> > >> >> >
> >>> > >> >> >
> >>> > >> >> >
> >>> > >> >> > This issue is extremely intermittent so I am not able to
> reproduce it with debug build. But on our test setup I have seen this core
> twice only for solaris Sparc server for other servers (RHEL, Suse linux,
> Solarisx86, Windows etc.) with similar test environment things are working
> fine.
> >>> > >> >> >
> >>> > >> >> > In two instances when I faced this issue we are restarting
> Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for
> rabbitmq. This performs 2 operations,
> >>> > >> >> >
> >>> > >> >> > First ping rabbitmq using "rabbitmqctl ping" to confirm
> rabbitmq is not already running ( I guess in background this will also
> start epmd) and then start rabbitmq-server in detached mode.
> >>> > >> >> >
> >>> > >> >> > Core is generated while starting this demon.
> >>> > >> >> >
> >>> > >> >> >
> >>> > >> >> > I checked code around abandon_carrier("
> https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c")
> but nothing changed in that area recently. So I am really clueless
> situation.
> >>> > >> >> >
> >>> > >> >> > Please le me know if anyone faced similar issue in past or
> have any idea around this. Using OTP version 22.2 and RabbitMQ version
> 3.7.23.
> >>> > >> >> >
> >>> > >> >> > Let me know any further information is required, pasting
> full core dump information below:
> >>> > >> >> >
> >>> > >> >> > debugging core file of beam.smp (64-bit) from hostname01
> >>> > >> >> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
> >>> > >> >> > initial argv:
> >>> > >> >> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
> >>> > >> >> > threading model: native threads
> >>> > >> >> > status: process terminated by SIGSEGV (Segmentation Fault),
> addr=
> >>> > >> >> > ffffffff004631b0
> >>> > >> >>
> >>> > >> >> Ok, this tells us the address was unmapped.  (It's not an
> alignment
> >>> > >> >> fault, another common issue on SPARC.)
> >>> > >> >>
> >>> > >> >>
> >>> > >> >> >
> >>> > >> >> > C++ symbol demangling enabled
> >>> > >> >> >
> >>> > >> >> > # stack
> >>> > >> >> >
> >>> > >> >> > cpool_insert+0xd0(10051c500, ffffffff7a400000,
> ffffffff7a441de8, ffffffff7c903818, 0, 23)
> >>> > >> >> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2,
> ffffffff7a441d88, 0, 10051c500)
> >>> > >> >> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1,
> ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
> >>> > >> >> >
> erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20,
> ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464,
> ffffffff3b729bd0)
> >>> > >> >> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400,
> 42da0c68)
> >>> > >> >> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9,
> ffffffff3b71f280, 402, 2)
> >>> > >> >> > process_main+0xc4(100469, ffffffff3b202240, fa0,
> ffffffff3b71f980, 241, 100294204)
> >>> > >> >> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0,
> ffffffff39401a40, 100000, 1)
> >>> > >> >> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48,
> ffffffff3b71f980, 100038da0)
> >>> > >> >> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
> >>> > >> >> >
> >>> > >> >> >
> #############################################################################
> >>> > >> >> >
> >>> > >> >> > # registers
> >>> > >> >> >
> >>> > >> >> > %g0 = 0x0000000000000000                 %l0 =
> 0xffffffff7a4307a0
> >>> > >> >> > %g1 = 0xffffffff004631a1                 %l1 =
> 0x0000000000000000
> >>> > >> >> > %g2 = 0x0000000000000000                 %l2 =
> 0x0000000000000000
> >>> > >> >> > %g3 = 0x000000010051c798                 %l3 =
> 0x0000000000000000
> >>> > >> >> > %g4 = 0xffffffff004631a0                 %l4 =
> 0x0000000000000000
> >>> > >> >> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5
> = 0x0000000000000000
> >>> > >> >>
> >>> > >> >> This is interesting.  Notice how the low 32-bits 004631a0 show
> up in
> >>> > >> >> three variations:
> >>> > >> >> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the
> address of the
> >>> > >> >> firstfit_carrier_pool global variable)
> >>> > >> >> 2. ffffffff004631a0 (the above, but with the high 32 bits
> replaced
> >>> > >> >> with all-bits-one)
> >>> > >> >> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
> >>> > >> >>
> >>> > >> >> > %g6 = 0x0000000000000000                 %l6 =
> 0x0000000000000000
> >>> > >> >> > %g7 = 0xffffffff39401a40                 %l7 =
> 0x0000000000000000
> >>> > >> >> > %o0 = 0x000000010051c500                 %i0 =
> 0x000000010051c500
> >>> > >> >> > %o1 = 0xffffffff7a400000                 %i1 =
> 0xffffffff7a400000
> >>> > >> >> > %o2 = 0x00000000000676c0                 %i2 =
> 0xffffffff7a441de8
> >>> > >> >> > %o3 = 0xffffffff7a400018                 %i3 =
> 0xffffffff7c903818
> >>> > >> >> > %o4 = 0x00000000000007b9                 %i4 =
> 0x0000000000000000
> >>> > >> >> > %o5 = 0x000000010051c790                 %i5 =
> 0x0000000000000023
> >>> > >> >> > %o6 = 0xffffffff7c902eb1                 %i6 =
> 0xffffffff7c902f61
> >>> > >> >> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 =
> 0x000000010006e958 dealloc_block.part.17+0x1c0
> >>> > >> >> >
> >>> > >> >> >  %ccr = 0x44 xcc=nZvc icc=nZvc
> >>> > >> >> >    %y = 0x0000000000000000
> >>> > >> >> >   %pc = 0x000000010006db14 cpool_insert+0xd0
> >>> > >> >> >  %npc = 0x000000010006db18 cpool_insert+0xd4
> >>> > >> >> >   %sp = 0xffffffff7c902eb1
> >>> > >> >> >   %fp = 0xffffffff7c902f61
> >>> > >> >> >
> >>> > >> >> >  %asi = 0x82
> >>> > >> >> > %fprs = 0x00
> >>> > >> >> >
> >>> > >> >> > # dissassembly around pc
> >>> > >> >> >
> >>> > >> >> > cpool_insert+0xa8:              mov       %g1, %g2
> >>> > >> >> > cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
> >>> > >> >> > cpool_insert+0xb0:              membar
> #LoadLoad|#LoadStore
> >>> > >> >> > cpool_insert+0xb4:              ba,pt     %xcc, +0x1c
>  <cpool_insert+0xd0>
> >>> > >> >> > cpool_insert+0xb8:              and       %g1, -0x4, %g4
> >>> > >> >>
> >>> > >> >> > cpool_insert+0xbc:              membar
> #LoadLoad|#LoadStore
> >>> > >> >> > cpool_insert+0xc0:              and       %g2, 0x3, %g3
> >>> > >> >> > cpool_insert+0xc4:              brz,pn    %g3, +0x1ec
>  <cpool_insert+0x2b0>
> >>> > >> >> > cpool_insert+0xc8:              mov       %g2, %g1
> >>> > >> >> > cpool_insert+0xcc:              and       %g1, -0x4, %g4
> >>> > >> >> > cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
> >>> > >> >>
> >>> > >> >> This is the faulting instruction. We're in the /* Find a
> predecessor
> >>> > >> >> to be, and set mod marker on its next ptr */ loop.
> >>> > >> >>
> >>> > >> >> > cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
> >>> > >> >> > cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
> >>> > >> >> > cpool_insert+0xdc:              cmp       %g5, %g4
> >>> > >> >> > cpool_insert+0xe0:              bne,pt    %xcc, -0x24
>  <cpool_insert+0xbc>
> >>> > >> >> > cpool_insert+0xe4:              or        %g2, %g1, %g2
> >>> > >> >>
> >>> > >> >> The above reads a 64-bit "->next" pointer by assembling two
> adjacent
> >>> > >> >> 32-bit fields.  Weird, but arithmetically Ok.
> >>> > >> >>
> >>> > >> >> Two things strike me:
> >>> > >> >> 1. The compiler implements "atomic load of 64-bits" as "load
> 32 bits,
> >>> > >> >> load another 32 bits, combine", which isn't correct in a
> multithreaded
> >>> > >> >> program.  The error could be in the compiler, or in the source
> code.
> >>> > >> >> 2. In the register dump it was obvious that the high bits of an
> >>> > >> >> address had been clobbered.
> >>> > >> >>
> >>> > >> >> My suspicion is that either Sun's compiler is buggy, or Erlang
> is
> >>> > >> >> selecting non thread-safe code in this case.
> >>> > >> >>
> >>> > >> >> On SPARC64 Linux w/ GCC I get very different code that uses
> "ldx" for
> >>> > >> >> those 64-bit loads, as expected.
> >>> > >> >>
> >>> > >> >> /Mikael
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20200520/a4098109/attachment.htm>


More information about the erlang-questions mailing list