<div dir="ltr">I would also like to add that your testcase for provoking this was very helpful. The crash will almost always show a random symptom and rarely the actual cause. The point of corruption has long since passed. Therefor a testcase showing the error is the most helpful in hunting down the problem.</div><div class="gmail_extra"><br><div class="gmail_quote">2015-05-22 21:32 GMT+02:00 Björn-Egil Dahlberg <span dir="ltr"><<a href="mailto:wallentin.dahlberg@gmail.com" target="_blank">wallentin.dahlberg@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I'll just mention that we have looked at this in the VM team at OTP and can confirm the results. We also have a reasonable idea of what's happening.<div>It seems that a binary match state is not handled properly in garbage collect literals (gc when purging code and moving literals to the process heaps).</div><div><br></div><div>I will look into it more on monday (or perhaps this weekend).</div><span class="HOEnZb"><font color="#888888"><div><br></div><div>// Björn-Egil</div></font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">2015-05-22 20:54 GMT+02:00 Bob Gustafson <span dir="ltr"><<a href="mailto:bobgus@rcn.com" target="_blank">bobgus@rcn.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
I am running 18.rc1 on a Macbook Air - I was able to duplicate your
segv (although I haven't seen the dump) - I ran it for about 20
minutes total and it segv 3x in that time frame.<br>
<br>
While running, I used the Mac Activity Monitor to peek.<br>
<br>
It is using about 380% CPU, 19 Threads and seems to be only one OSX
process.<br>
<br>
Memory jumps around a bit - roughly 360-420 MB initially and then it
drops down - I saw 127 MB, and then it was gone - crashed.<br>
<br>
I will fish around for the crash dump.<br>
<br>
Have fun<br>
<br>
Bob G<div><div><br>
<br>
<div>On 05/21/2015 11:48 AM, Soup wrote:<br>
</div>
</div></div><blockquote type="cite"><div><div>
<div dir="ltr">
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif">This topic, or
one very similar, appears to have been discussed before in the
erlang-patches mailing list thread titled "erlang node crashes
in erts_gc_after_bif_call" from October, 2012 (<a href="http://erlang.org/pipermail/erlang-patches/2012-October/003072.html" target="_blank">http://erlang.org/pipermail/erlang-patches/2012-October/003072.html</a>).
No clear resolution was reached on this thread, and I am
currently dealing with it in production systems, so I have
decided to address this mailing list.<br>
<br>
Please see the bottom of the email for system specification,
as I believe this to be largely unrelated (except possibly for
multithreading).<br>
<br>
Please feel free to request any pertinent information I may
have left out, or to make suggestions to improve future bug
reports. I don't often submit bug reports, and am not at all
familiar with Erlang/OTP's particular practices in this
regard.<br>
<br>
<b><font size="4">## Scenario and Error ##</font><br>
</b><br>
The error is a segmentation fault arising out of the
erts_garbage_collect and check_process_code functions.<br>
<br>
The scenario is as follows:<br>
1) You must be hot-loading a module (in my case, this module
is dynamically generated) periodically.<br>
2) You must have non-suspended processes active in the module
you are hot-loading while it is being loaded (though not
necessarily *in* the code of the module; may be using terms
from the module or having function references ot the module).<br>
3) Purging of the <b>old</b> version of the module must be
happening at the same time as garbage collection. (in my case,
the garbage collection is explicit because of the use of large
binary terms with relatively few reductions; that does not
appear to be the case in the situation laid out in the
previously mentioned thread).<br>
<br>
It appears, at least to my untrained eye, that garbage
collection sweeps can occur at the same time as code purging,
and that this seems to happen without multithreading
protection. My reason for this suspicion is that in my
production systems I began receiving one of two segmentation
faults: one occuring in the function check_process_code (of
erts/emulator/beam/beam_bif_load.c) and erts_garbage_collect
(of erts/emulator/beam/erl_gc.c). Most of the time *in
production*, the segmentation fault occured in the
check_process_code function. Only sometimes did it appear to
be coming from erts_garbage_collect.<br>
<br>
<b style="font-size:large">## Reproducing the Error ##</b><br>
<br>
<div>It took a while, but I did ultimately manage to create an
app which reliably produces this error (insofar as I can
tell). Please see the app here: <a href="https://github.com/fauxsoup/erlang-sigsegv" target="_blank">https://github.com/fauxsoup/erlang-sigsegv</a><br>
<br>
There are some apparent differences from what I was
observing in production, but this could possibly be related
to differences between my production environment and my
testing environment (which are non-trivial), and potentially
differences between my minimal test case and the production
service. Please see the bottom of this email for pertinent
details about both environments.</div>
<div><br>
For testing, and because my production deployment of Erlang
does not include debug symbols, I recompiled Erlang/OTP 17.4
with the flags "-g -O2" to produce debug symbols and prevent
aggressive optimizations which may distort the stacktrace.<br>
<br>
The primary difference between the *results* of the error in
production versus testing is that the segmentation fault in
testing <b>always </b>comes from erts_garbage_collect. I
have not at all been able to produce a test result in which
the segmentation fault occured in check_process_code using
the minimal test case code.<br>
<br>
Another difference, which I believe to be caused by the
inclusion of debug symbols, is that erts_garbage_collect
appears earlier in the backtrace in testing, and that the
actual segmentation fault appears to come from the function
sweep_one_area (erl_gc.c again). My assumption is that the
optimization and lack of debug symbols in the production
system merely obfuscated the origin of the segmentation
fault there.<br>
<br>
<b><font size="4">## The Backtrace ##<br>
<br>
</font></b></div>
<div><span style="font-family:arial,sans-serif;font-size:12.8000001907349px">Included
here for your convenience (also available in test case
README):</span><br style="font-family:arial,sans-serif;font-size:12.8000001907349px">
<br style="font-family:arial,sans-serif;font-size:12.8000001907349px">
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">Program
received signal SIGSEGV, Segmentation fault.</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">
[Switching to Thread 0x7ffff3b3e700 (LWP 26743)]</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">sweep_one_area
(n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48,
src=src@entry=0x7fffe9ec2028 "",
src_size=src_size@entry=600224) at beam/erl_gc.c:1816</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">1816<span style="white-space:pre-wrap"> </span>mb->base =
binary_bytes(*origptr);</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">(gdb)
bt</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">#0
sweep_one_area (n_hp=0x7fffe8862028,
n_htop=0x7fffe8862c48, src=src@entry=0x7fffe9ec2028 "",
src_size=src_size@entry=600224) at beam/erl_gc.c:1816</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">#1
0x0000000000527ea0 in do_minor (nobj=1,
objv=0x7ffff3b3dd50, new_sz=121536, p=0x7ffff5c80800) at
beam/erl_gc.c:1160</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">#2
minor_collection (recl=<synthetic pointer>, nobj=1,
objv=0x7ffff3b3dd50, need=0, p=0x7ffff5c80800) at
beam/erl_gc.c:876</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">#3
erts_garbage_collect (p=0x7ffff5c80800,
need=need@entry=0, objv=objv@entry=0x7ffff3b3dd50,
nobj=nobj@entry=1) at beam/erl_gc.c:450</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">#4
0x000000000052877b in erts_gc_after_bif_call
(p=0x7ffff5c80800, result=140736302308346,
regs=<optimized out>, arity=<optimized out>)
at beam/erl_gc.c:370</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">#5
0x0000000000571951 in process_main () at
beam/beam_emu.c:2787</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">#6
0x00000000004a9a70 in sched_thread_func
(vesdp=0x7ffff51cc8c0) at beam/erl_process.c:7743</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">#7
0x00000000006056fb in thr_wrapper (vtwd=0x7fffffffd9a0)
at pthread/ethread.c:106</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">#8
0x00007ffff704d374 in start_thread () from
/usr/lib/libpthread.so.0</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">#9
0x00007ffff6b8327d in clone () from /usr/lib/libc.so.6</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px"><br>
<b><font size="4">## The Systems ##</font></b><br>
<br>
<b>PRODUCTION<br>
</b>Erlang/OTP 17.4 (also observed on Erlang R15B01)<br>
Amazon EC2 c3.8xlarge (32 Virtual CPUs, ~64 GB Memory)</div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">Debian
Wheezy<br>
uname -a: Linux <a href="http://rtb0.ec2.chitika.net/" target="_blank">rtb0.ec2.chitika.net</a> 3.2.0-4-amd64
#1 SMP Debian 3.2.63-2 x86_64 GNU/Linux<b><br>
</b></div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px"><br>
<b>TESTING</b></div>
<div style="font-family:arial,sans-serif;font-size:12.8000001907349px">Erlang/OTP
17.4<br>
Intel Core i5 760 @ 2.80GHz (4 Logical CPUs, 2 cores
IIRC), ~16GB Memory<br>
Arch Linux (up-to-date)<br>
uname -a: Linux diogenes 4.0.1-1-ARCH #1 SMP PREEMPT Wed
Apr 29 12:00:26 CEST 2015 x86_64 GNU/Linux</div>
</div>
</div>
</div>
<br>
<fieldset></fieldset>
<br>
</div></div><pre>_______________________________________________
erlang-bugs mailing list
<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a>
</pre>
</blockquote>
<br>
</div>
<br>_______________________________________________<br>
erlang-bugs mailing list<br>
<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
<br></blockquote></div><br></div>
</div></div></blockquote></div><br></div>