<div dir="ltr"><div dir="ltr"><p class="MsoNormal" style="margin:0cm 0cm 0.0001pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US">Hi,</span></p>
<p class="MsoNormal" style="margin:0cm 0cm 0.0001pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US"> </span></p>
<p class="MsoNormal" style="margin:0cm 0cm 0.0001pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US">Thanks for response Mikael</span></p>
<p class="MsoNormal" style="margin:0cm 0cm 0.0001pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US">As per your
suggestion I am trying to write similar code to conclude if there is some issue
with Solaris SPARC compiler.</span></p>
<p class="MsoNormal" style="margin:0cm 0cm 0.0001pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US"> </span></p>
<p class="MsoNormal" style="margin:0cm 0cm 0.0001pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US">But I have
some doubts, </span></p>
<p class="gmail-MsoListParagraphCxSpFirst" style="margin:0cm 0cm 0.0001pt 36pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US">1.<span style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:7pt;line-height:normal;font-family:"Times New Roman""> </span></span><span lang="EN-US">If there is problem with compiler
then we should be able to see this crash everywhere else also, any idea why its only
reproduced here?</span></p>
<p class="gmail-MsoListParagraphCxSpLast" style="margin:0cm 0cm 0.0001pt 36pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US">2.<span style="font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:7pt;line-height:normal;font-family:"Times New Roman""> </span></span><span lang="EN-US">As I understand your explanation it reads 64 bits by
assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current
context of the thread (and hence registers) and will bring back when thread is
active again.</span></p><p class="gmail-MsoListParagraphCxSpLast" style="margin:0cm 0cm 0.0001pt 36pt;font-size:12pt;font-family:Calibri,sans-serif"><br></p><p class="gmail-MsoListParagraphCxSpLast" style="margin:0cm 0cm 0.0001pt 36pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US"></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 0.0001pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US"> </span></p>
<p class="MsoNormal" style="margin:0cm 0cm 0.0001pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US">Thanks
& Regards,</span></p>
<p class="MsoNormal" style="margin:0cm 0cm 0.0001pt;font-size:12pt;font-family:Calibri,sans-serif"><span lang="EN-US">Pooja </span></p></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <<a href="mailto:mikpelinux@gmail.com">mikpelinux@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello Pooja,<br>
<br>
On Mon, May 11, 2020 at 8:10 AM Pooja Desai <<a href="mailto:pooja.desai10@gmail.com" target="_blank">pooja.desai10@gmail.com</a>> wrote:<br>
><br>
> Hi,<br>
><br>
> Facing erlang core issue on solaris SPARC setup while running RabbitMQ<br>
<br>
This looks like a 64-bit build, but the code doesn't look similar to<br>
what I get with gcc-9.3, so I'm assuming you used Sun's compiler?<br>
<br>
<br>
> (dbx) where<br>
><br>
> =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14<br>
><br>
> [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c<br>
><br>
> [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958<br>
><br>
> [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244<br>
><br>
> [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),<br>
><br>
> at 0x1000622c0<br>
><br>
> [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044<br>
><br>
> [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040<br>
><br>
> [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc<br>
><br>
> [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08<br>
><br>
> [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8<br>
><br>
><br>
><br>
> This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.<br>
><br>
> In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,<br>
><br>
> First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.<br>
><br>
> Core is generated while starting this demon.<br>
><br>
><br>
> I checked code around abandon_carrier("<a href="https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c" rel="noreferrer" target="_blank">https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c</a>") but nothing changed in that area recently. So I am really clueless situation.<br>
><br>
> Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.<br>
><br>
> Let me know any further information is required, pasting full core dump information below:<br>
><br>
> debugging core file of beam.smp (64-bit) from hostname01<br>
> file: temp_dir/erlang/erts-10.6/bin/beam.smp<br>
> initial argv:<br>
> /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/<br>
> threading model: native threads<br>
> status: process terminated by SIGSEGV (Segmentation Fault), addr=<br>
> ffffffff004631b0<br>
<br>
Ok, this tells us the address was unmapped. (It's not an alignment<br>
fault, another common issue on SPARC.)<br>
<br>
<br>
><br>
> C++ symbol demangling enabled<br>
><br>
> # stack<br>
><br>
> cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)<br>
> dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)<br>
> erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)<br>
> erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)<br>
> handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)<br>
> erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)<br>
> process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)<br>
> sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)<br>
> thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)<br>
> libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)<br>
><br>
> #############################################################################<br>
><br>
> # registers<br>
><br>
> %g0 = 0x0000000000000000 %l0 = 0xffffffff7a4307a0<br>
> %g1 = 0xffffffff004631a1 %l1 = 0x0000000000000000<br>
> %g2 = 0x0000000000000000 %l2 = 0x0000000000000000<br>
> %g3 = 0x000000010051c798 %l3 = 0x0000000000000000<br>
> %g4 = 0xffffffff004631a0 %l4 = 0x0000000000000000<br>
> %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000<br>
<br>
This is interesting. Notice how the low 32-bits 004631a0 show up in<br>
three variations:<br>
1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the<br>
firstfit_carrier_pool global variable)<br>
2. ffffffff004631a0 (the above, but with the high 32 bits replaced<br>
with all-bits-one)<br>
3. ffffffff004631a1 (the above, but with a tag in the low bit)<br>
<br>
> %g6 = 0x0000000000000000 %l6 = 0x0000000000000000<br>
> %g7 = 0xffffffff39401a40 %l7 = 0x0000000000000000<br>
> %o0 = 0x000000010051c500 %i0 = 0x000000010051c500<br>
> %o1 = 0xffffffff7a400000 %i1 = 0xffffffff7a400000<br>
> %o2 = 0x00000000000676c0 %i2 = 0xffffffff7a441de8<br>
> %o3 = 0xffffffff7a400018 %i3 = 0xffffffff7c903818<br>
> %o4 = 0x00000000000007b9 %i4 = 0x0000000000000000<br>
> %o5 = 0x000000010051c790 %i5 = 0x0000000000000023<br>
> %o6 = 0xffffffff7c902eb1 %i6 = 0xffffffff7c902f61<br>
> %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0<br>
><br>
> %ccr = 0x44 xcc=nZvc icc=nZvc<br>
> %y = 0x0000000000000000<br>
> %pc = 0x000000010006db14 cpool_insert+0xd0<br>
> %npc = 0x000000010006db18 cpool_insert+0xd4<br>
> %sp = 0xffffffff7c902eb1<br>
> %fp = 0xffffffff7c902f61<br>
><br>
> %asi = 0x82<br>
> %fprs = 0x00<br>
><br>
> # dissassembly around pc<br>
><br>
> cpool_insert+0xa8: mov %g1, %g2<br>
> cpool_insert+0xac: ldx [%g5 + 0x10], %g1<br>
> cpool_insert+0xb0: membar #LoadLoad|#LoadStore<br>
> cpool_insert+0xb4: ba,pt %xcc, +0x1c <cpool_insert+0xd0><br>
> cpool_insert+0xb8: and %g1, -0x4, %g4<br>
<br>
> cpool_insert+0xbc: membar #LoadLoad|#LoadStore<br>
> cpool_insert+0xc0: and %g2, 0x3, %g3<br>
> cpool_insert+0xc4: brz,pn %g3, +0x1ec <cpool_insert+0x2b0><br>
> cpool_insert+0xc8: mov %g2, %g1<br>
> cpool_insert+0xcc: and %g1, -0x4, %g4<br>
> cpool_insert+0xd0: ld [%g4 + 0x10], %g1<br>
<br>
This is the faulting instruction. We're in the /* Find a predecessor<br>
to be, and set mod marker on its next ptr */ loop.<br>
<br>
> cpool_insert+0xd4: ld [%g4 + 0x14], %g2<br>
> cpool_insert+0xd8: sllx %g1, 0x20, %g1<br>
> cpool_insert+0xdc: cmp %g5, %g4<br>
> cpool_insert+0xe0: bne,pt %xcc, -0x24 <cpool_insert+0xbc><br>
> cpool_insert+0xe4: or %g2, %g1, %g2<br>
<br>
The above reads a 64-bit "->next" pointer by assembling two adjacent<br>
32-bit fields. Weird, but arithmetically Ok.<br>
<br>
Two things strike me:<br>
1. The compiler implements "atomic load of 64-bits" as "load 32 bits,<br>
load another 32 bits, combine", which isn't correct in a multithreaded<br>
program. The error could be in the compiler, or in the source code.<br>
2. In the register dump it was obvious that the high bits of an<br>
address had been clobbered.<br>
<br>
My suspicion is that either Sun's compiler is buggy, or Erlang is<br>
selecting non thread-safe code in this case.<br>
<br>
On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for<br>
those 64-bit loads, as expected.<br>
<br>
/Mikael<br>
</blockquote></div></div>