erlang (rabbitmq) generating core on Solaris SPARC
Pooja Desai
pooja.desai10@REDACTED
Tue May 19 18:24:27 CEST 2020
Hi Mikael,
gcc bug mention above is not specific to any platform but problematic
disassembly is only generated for solaris sparc. Any idea why only solaris
sparc erlang is affected by this?
Actually to minimise impact on testing/sock we are thinking about only
rebuilding erlang on solaris sparc for now as issue is only faced on
solaris platform. So checking your expert opinion, do you see any problem
with this approach?
Thanks & Regards,
Pooja
On Fri, May 15, 2020 at 1:51 PM Pooja Desai <pooja.desai10@REDACTED> wrote:
> Thanks Mikael,
>
> As per your suggestion I am rebuilding erlang with newer gcc version.
> Thanks for helping with this.
>
> Thanks & Regards,
> Pooja
>
> On Fri, May 15, 2020 at 3:20 AM Mikael Pettersson <mikpelinux@REDACTED>
> wrote:
>
>> On Thu, May 14, 2020 at 12:09 PM Mikael Pettersson <mikpelinux@REDACTED>
>> wrote:
>> >
>> > On Thu, May 14, 2020 at 9:32 AM Pooja Desai <pooja.desai10@REDACTED>
>> wrote:
>> > >
>> > > Hi Mikael,
>> > >
>> > >
>> > > Please find flies you requested in attachment as erl_files.tar.gz
>> (compressed as facing issue with mail size)
>> > >
>> > > Normal build option is:
>> > >
>> > > # gcc -Werror=undef -Werror=implicit -Werror=return-type -m64 -g
>> -O3 -fomit-frame-pointer
>> -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10 -D_LARGEFILE_SOURCE
>> -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename -DHAVE_CONFIG_H -Wall
>> -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement
>> -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS
>> -D_POSIX_PTHREAD_SEMANTICS -Isparc-sun-solaris2.10/opt/smp -Ibeam
>> -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib -Ipcre -Ihipe
>> -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal
>> -I../include/internal/sparc-sun-solaris2.10 -c beam/erl_alloc_util.c -o
>> obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.o
>> > >
>> > > after your suggestion I updated it as below to generate
>> erl_alloc_util file:
>> > >
>> > > # gcc -Werror=undef -Werror=implicit -Werror=return-type -m64 -g
>> -O3 -fomit-frame-pointer
>> -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10 -D_LARGEFILE_SOURCE
>> -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename -DHAVE_CONFIG_H -Wall
>> -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement
>> -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS
>> -D_POSIX_PTHREAD_SEMANTICS -Isparc-sun-solaris2.10/opt/smp -Ibeam
>> -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib -Ipcre -Ihipe
>> -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal
>> -I../include/internal/sparc-sun-solaris2.10 -E beam/erl_alloc_util.c -o
>> obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.i
>> > >
>> > > Also one thing I missed to mention, we are using gcc version 4.9.2
>> (GCC) for building on solaris SPARC as erlang doesn't support Sun's native
>> compiler.
>> >
>> > I've been able to reproduce the non-atomic code for those 64-bit loads
>> > in cpool_insert() using gcc-4.9 cross compilers to sparc64-linux, but
>> > gcc-5.5/6.5/7.5/8.4/9.3 all emit correct code as far as I can tell.
>> >
>> > So the solution is to upgrade your gcc (I suggest 9.3.0) and rebuild
>> > your Erlang/OTP VM with that.
>> >
>> > /Mikael
>>
>> I created a reduced test case from erl_alloc.i, and it turns out
>> Erlang/OTP was hit by
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70424, which affects
>> gcc-4.9 (all versions) and gcc-5.x (x < 4), on all strict-alignment
>> targets.
>>
>> So the recommendation stands: upgrade your gcc.
>>
>> > > Thanks & Regards,
>> > > Pooja
>> > >
>> > > On Tue, May 12, 2020 at 10:44 PM Mikael Pettersson <
>> mikpelinux@REDACTED> wrote:
>> > >>
>> > >> On Tue, May 12, 2020 at 4:18 PM Pooja Desai <pooja.desai10@REDACTED>
>> wrote:
>> > >> >
>> > >> > Hi,
>> > >> >
>> > >> >
>> > >> >
>> > >> > Thanks for response Mikael
>> > >> >
>> > >> > As per your suggestion I am trying to write similar code to
>> conclude if there is some issue with Solaris SPARC compiler.
>> > >> >
>> > >> >
>> > >> >
>> > >> > But I have some doubts,
>> > >> >
>> > >> > 1. If there is problem with compiler then we should be able to
>> see this crash everywhere else also, any idea why its only reproduced here?
>> > >> >
>> > >> > 2. As I understand your explanation it reads 64 bits by
>> assembling two adjacent 32 bits fields. Will it really cause problem in
>> multi-threaded program? Considering while context switching to another
>> thread, OS will save current context of the thread (and hence registers)
>> and will bring back when thread is active again.
>> > >> >
>> > >> >
>> > >>
>> > >> Breaking up a 64-bit load into two 32-bit loads loses atomicity with
>> > >> any concurrent store into that location, meaning the read may end up
>> > >> observing a result composed of 32 bit from the old value and 32 bit
>> > >> from the newly stored value, whereas the code expects to see either
>> > >> the old or the new, but never this mixture. This can happen also on
>> a
>> > >> single-threaded CPU with preemptive multitasking.
>> > >>
>> > >> To move forward on the issue, I think you need to recreate the
>> > >> pre-processed source for erl_alloc_util.c. To do that:
>> > >> 1. Compile Erlang/OTP as usual, starting from a pristine source
>> > >> directory (no left-overs from a previous build, best is to start
>> fresh
>> > >> somewhere), but pass "V=1" to make. Save the output from "make" in a
>> > >> file.
>> > >> 2. Note the step where it compiles erl_alloc_util.c.
>> > >> 3. Reexecute that step, but replace any "-c" with "-E" and "-o
>> > >> erl_alloc_util.o" with "-o erl_alloc_util.i".
>> > >> 4. Please send this ".i" file, together with the exact build steps
>> and
>> > >> configuration options you used, and
>> > >> "erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
>> > >> to me.
>> > >>
>> > >> My theory is that Erlang/OTP selects the wrong low-level primitives
>> > >> for this platform.
>> > >>
>> > >>
>> > >> >
>> > >> >
>> > >> > Thanks & Regards,
>> > >> >
>> > >> > Pooja
>> > >> >
>> > >> >
>> > >> > On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <
>> mikpelinux@REDACTED> wrote:
>> > >> >>
>> > >> >> Hello Pooja,
>> > >> >>
>> > >> >> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <
>> pooja.desai10@REDACTED> wrote:
>> > >> >> >
>> > >> >> > Hi,
>> > >> >> >
>> > >> >> > Facing erlang core issue on solaris SPARC setup while running
>> RabbitMQ
>> > >> >>
>> > >> >> This looks like a 64-bit build, but the code doesn't look similar
>> to
>> > >> >> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
>> > >> >>
>> > >> >>
>> > >> >> > (dbx) where
>> > >> >> >
>> > >> >> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850,
>> 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>> > >> >> >
>> > >> >> > [2] abandon_carrier(0x1004efd40, 0xffffffff75600000,
>> 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>> > >> >> >
>> > >> >> > [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0,
>> 0x1004efd40), at 0x10006e958
>> > >> >> >
>> > >> >> > [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1,
>> 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44,
>> 0x8000000000000007), at 0x100075244
>> > >> >> >
>> > >> >> > [5]
>> erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620,
>> 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464,
>> 0xffffffff3a82a5d0),
>> > >> >> >
>> > >> >> > at 0x1000622c0
>> > >> >> >
>> > >> >> > [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2,
>> 0x100400, 0x4e5ce123), at 0x1002a6044
>> > >> >> >
>> > >> >> > [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9,
>> 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>> > >> >> >
>> > >> >> > [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a,
>> 0xffffffff38f00438, 0x3), at 0x1002901bc
>> > >> >> >
>> > >> >> > [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0,
>> 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>> > >> >> >
>> > >> >> > [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48,
>> 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>> > >> >> >
>> > >> >> >
>> > >> >> >
>> > >> >> > This issue is extremely intermittent so I am not able to
>> reproduce it with debug build. But on our test setup I have seen this core
>> twice only for solaris Sparc server for other servers (RHEL, Suse linux,
>> Solarisx86, Windows etc.) with similar test environment things are working
>> fine.
>> > >> >> >
>> > >> >> > In two instances when I faced this issue we are restarting
>> Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for
>> rabbitmq. This performs 2 operations,
>> > >> >> >
>> > >> >> > First ping rabbitmq using "rabbitmqctl ping" to confirm
>> rabbitmq is not already running ( I guess in background this will also
>> start epmd) and then start rabbitmq-server in detached mode.
>> > >> >> >
>> > >> >> > Core is generated while starting this demon.
>> > >> >> >
>> > >> >> >
>> > >> >> > I checked code around abandon_carrier("
>> https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c")
>> but nothing changed in that area recently. So I am really clueless
>> situation.
>> > >> >> >
>> > >> >> > Please le me know if anyone faced similar issue in past or have
>> any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>> > >> >> >
>> > >> >> > Let me know any further information is required, pasting full
>> core dump information below:
>> > >> >> >
>> > >> >> > debugging core file of beam.smp (64-bit) from hostname01
>> > >> >> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
>> > >> >> > initial argv:
>> > >> >> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
>> > >> >> > threading model: native threads
>> > >> >> > status: process terminated by SIGSEGV (Segmentation Fault),
>> addr=
>> > >> >> > ffffffff004631b0
>> > >> >>
>> > >> >> Ok, this tells us the address was unmapped. (It's not an
>> alignment
>> > >> >> fault, another common issue on SPARC.)
>> > >> >>
>> > >> >>
>> > >> >> >
>> > >> >> > C++ symbol demangling enabled
>> > >> >> >
>> > >> >> > # stack
>> > >> >> >
>> > >> >> > cpool_insert+0xd0(10051c500, ffffffff7a400000,
>> ffffffff7a441de8, ffffffff7c903818, 0, 23)
>> > >> >> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2,
>> ffffffff7a441d88, 0, 10051c500)
>> > >> >> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1,
>> ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
>> > >> >> >
>> erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20,
>> ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464,
>> ffffffff3b729bd0)
>> > >> >> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400,
>> 42da0c68)
>> > >> >> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280,
>> 402, 2)
>> > >> >> > process_main+0xc4(100469, ffffffff3b202240, fa0,
>> ffffffff3b71f980, 241, 100294204)
>> > >> >> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0,
>> ffffffff39401a40, 100000, 1)
>> > >> >> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48,
>> ffffffff3b71f980, 100038da0)
>> > >> >> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>> > >> >> >
>> > >> >> >
>> #############################################################################
>> > >> >> >
>> > >> >> > # registers
>> > >> >> >
>> > >> >> > %g0 = 0x0000000000000000 %l0 =
>> 0xffffffff7a4307a0
>> > >> >> > %g1 = 0xffffffff004631a1 %l1 =
>> 0x0000000000000000
>> > >> >> > %g2 = 0x0000000000000000 %l2 =
>> 0x0000000000000000
>> > >> >> > %g3 = 0x000000010051c798 %l3 =
>> 0x0000000000000000
>> > >> >> > %g4 = 0xffffffff004631a0 %l4 =
>> 0x0000000000000000
>> > >> >> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 =
>> 0x0000000000000000
>> > >> >>
>> > >> >> This is interesting. Notice how the low 32-bits 004631a0 show up
>> in
>> > >> >> three variations:
>> > >> >> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address
>> of the
>> > >> >> firstfit_carrier_pool global variable)
>> > >> >> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
>> > >> >> with all-bits-one)
>> > >> >> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
>> > >> >>
>> > >> >> > %g6 = 0x0000000000000000 %l6 =
>> 0x0000000000000000
>> > >> >> > %g7 = 0xffffffff39401a40 %l7 =
>> 0x0000000000000000
>> > >> >> > %o0 = 0x000000010051c500 %i0 =
>> 0x000000010051c500
>> > >> >> > %o1 = 0xffffffff7a400000 %i1 =
>> 0xffffffff7a400000
>> > >> >> > %o2 = 0x00000000000676c0 %i2 =
>> 0xffffffff7a441de8
>> > >> >> > %o3 = 0xffffffff7a400018 %i3 =
>> 0xffffffff7c903818
>> > >> >> > %o4 = 0x00000000000007b9 %i4 =
>> 0x0000000000000000
>> > >> >> > %o5 = 0x000000010051c790 %i5 =
>> 0x0000000000000023
>> > >> >> > %o6 = 0xffffffff7c902eb1 %i6 =
>> 0xffffffff7c902f61
>> > >> >> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 =
>> 0x000000010006e958 dealloc_block.part.17+0x1c0
>> > >> >> >
>> > >> >> > %ccr = 0x44 xcc=nZvc icc=nZvc
>> > >> >> > %y = 0x0000000000000000
>> > >> >> > %pc = 0x000000010006db14 cpool_insert+0xd0
>> > >> >> > %npc = 0x000000010006db18 cpool_insert+0xd4
>> > >> >> > %sp = 0xffffffff7c902eb1
>> > >> >> > %fp = 0xffffffff7c902f61
>> > >> >> >
>> > >> >> > %asi = 0x82
>> > >> >> > %fprs = 0x00
>> > >> >> >
>> > >> >> > # dissassembly around pc
>> > >> >> >
>> > >> >> > cpool_insert+0xa8: mov %g1, %g2
>> > >> >> > cpool_insert+0xac: ldx [%g5 + 0x10], %g1
>> > >> >> > cpool_insert+0xb0: membar #LoadLoad|#LoadStore
>> > >> >> > cpool_insert+0xb4: ba,pt %xcc, +0x1c
>> <cpool_insert+0xd0>
>> > >> >> > cpool_insert+0xb8: and %g1, -0x4, %g4
>> > >> >>
>> > >> >> > cpool_insert+0xbc: membar #LoadLoad|#LoadStore
>> > >> >> > cpool_insert+0xc0: and %g2, 0x3, %g3
>> > >> >> > cpool_insert+0xc4: brz,pn %g3, +0x1ec
>> <cpool_insert+0x2b0>
>> > >> >> > cpool_insert+0xc8: mov %g2, %g1
>> > >> >> > cpool_insert+0xcc: and %g1, -0x4, %g4
>> > >> >> > cpool_insert+0xd0: ld [%g4 + 0x10], %g1
>> > >> >>
>> > >> >> This is the faulting instruction. We're in the /* Find a
>> predecessor
>> > >> >> to be, and set mod marker on its next ptr */ loop.
>> > >> >>
>> > >> >> > cpool_insert+0xd4: ld [%g4 + 0x14], %g2
>> > >> >> > cpool_insert+0xd8: sllx %g1, 0x20, %g1
>> > >> >> > cpool_insert+0xdc: cmp %g5, %g4
>> > >> >> > cpool_insert+0xe0: bne,pt %xcc, -0x24
>> <cpool_insert+0xbc>
>> > >> >> > cpool_insert+0xe4: or %g2, %g1, %g2
>> > >> >>
>> > >> >> The above reads a 64-bit "->next" pointer by assembling two
>> adjacent
>> > >> >> 32-bit fields. Weird, but arithmetically Ok.
>> > >> >>
>> > >> >> Two things strike me:
>> > >> >> 1. The compiler implements "atomic load of 64-bits" as "load 32
>> bits,
>> > >> >> load another 32 bits, combine", which isn't correct in a
>> multithreaded
>> > >> >> program. The error could be in the compiler, or in the source
>> code.
>> > >> >> 2. In the register dump it was obvious that the high bits of an
>> > >> >> address had been clobbered.
>> > >> >>
>> > >> >> My suspicion is that either Sun's compiler is buggy, or Erlang is
>> > >> >> selecting non thread-safe code in this case.
>> > >> >>
>> > >> >> On SPARC64 Linux w/ GCC I get very different code that uses "ldx"
>> for
>> > >> >> those 64-bit loads, as expected.
>> > >> >>
>> > >> >> /Mikael
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20200519/92f99640/attachment.htm>
More information about the erlang-questions
mailing list