[erlang-bugs] Segmentation Fault in check_process_code / erts_garbage_collect

Björn-Egil Dahlberg wallentin.dahlberg@REDACTED
Fri May 22 21:32:52 CEST 2015


I'll just mention that we have looked at this in the VM team at OTP and can
confirm the results. We also have a reasonable idea of what's happening.
It seems that a binary match state is not handled properly in garbage
collect literals (gc when purging code and moving literals to the process
heaps).

I will look into it more on monday (or perhaps this weekend).

// Björn-Egil

2015-05-22 20:54 GMT+02:00 Bob Gustafson <bobgus@REDACTED>:

>  I am running 18.rc1 on a Macbook Air - I was able to duplicate your segv
> (although I haven't seen the dump) - I ran it for about 20 minutes total
> and it segv 3x in that time frame.
>
> While running, I used the Mac Activity Monitor to peek.
>
> It is using about 380% CPU, 19 Threads and seems to be only one OSX
> process.
>
> Memory jumps around a bit - roughly 360-420 MB initially and then it drops
> down - I saw 127 MB, and then it was gone - crashed.
>
> I will fish around for the crash dump.
>
> Have fun
>
> Bob G
>
>
> On 05/21/2015 11:48 AM, Soup wrote:
>
>  This topic, or one very similar, appears to have been discussed before
> in the erlang-patches mailing list thread titled "erlang node crashes in
> erts_gc_after_bif_call" from October, 2012 (
> http://erlang.org/pipermail/erlang-patches/2012-October/003072.html). No
> clear resolution was reached on this thread, and I am currently dealing
> with it in production systems, so I have decided to address this mailing
> list.
>
> Please see the bottom of the email for system specification, as I believe
> this to be largely unrelated (except possibly for multithreading).
>
> Please feel free to request any pertinent information I may have left out,
> or to make suggestions to improve future bug reports. I don't often submit
> bug reports, and am not at all familiar with Erlang/OTP's particular
> practices in this regard.
>
>
> *## Scenario and Error ## *
> The error is a segmentation fault arising out of the erts_garbage_collect
> and check_process_code functions.
>
> The scenario is as follows:
> 1) You must be hot-loading a module (in my case, this module is
> dynamically generated) periodically.
> 2) You must have non-suspended processes active in the module you are
> hot-loading while it is being loaded (though not necessarily *in* the code
> of the module; may be using terms from the module or having function
> references ot the module).
> 3) Purging of the *old* version of the module must be happening at the
> same time as garbage collection. (in my case, the garbage collection is
> explicit because of the use of large binary terms with relatively few
> reductions; that does not appear to be the case in the situation laid out
> in the previously mentioned thread).
>
> It appears, at least to my untrained eye, that garbage collection sweeps
> can occur at the same time as code purging, and that this seems to happen
> without multithreading protection. My reason for this suspicion is that in
> my production systems I began receiving one of two segmentation faults: one
> occuring in the function check_process_code (of
> erts/emulator/beam/beam_bif_load.c) and erts_garbage_collect (of
> erts/emulator/beam/erl_gc.c). Most of the time *in production*, the
> segmentation fault occured in the check_process_code function. Only
> sometimes did it appear to be coming from erts_garbage_collect.
>
> *## Reproducing the Error ##*
>
> It took a while, but I did ultimately manage to create an app which
> reliably produces this error (insofar as I can tell). Please see the app
> here: https://github.com/fauxsoup/erlang-sigsegv
>
> There are some apparent differences from what I was observing in
> production, but this could possibly be related to differences between my
> production environment and my testing environment (which are non-trivial),
> and potentially differences between my minimal test case and the production
> service. Please see the bottom of this email for pertinent details about
> both environments.
>
> For testing, and because my production deployment of Erlang does not
> include debug symbols, I recompiled Erlang/OTP 17.4 with the flags "-g -O2"
> to produce debug symbols and prevent aggressive optimizations which may
> distort the stacktrace.
>
> The primary difference between the *results* of the error in production
> versus testing is that the segmentation fault in testing *always *comes
> from erts_garbage_collect. I have not at all been able to produce a test
> result in which the segmentation fault occured in check_process_code using
> the minimal test case code.
>
> Another difference, which I believe to be caused by the inclusion of debug
> symbols, is that erts_garbage_collect appears earlier in the backtrace in
> testing, and that the actual segmentation fault appears to come from the
> function sweep_one_area (erl_gc.c again). My assumption is that the
> optimization and lack of debug symbols in the production system merely
> obfuscated the origin of the segmentation fault there.
>
>
>
> *## The Backtrace ## *
> Included here for your convenience (also available in test case README):
>
> Program received signal SIGSEGV, Segmentation fault.
>                                                     [Switching to Thread
> 0x7ffff3b3e700 (LWP 26743)]
> sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48, src=src@REDACTED=0x7fffe9ec2028
> "", src_size=src_size@REDACTED=600224) at beam/erl_gc.c:1816
> 1816 mb->base = binary_bytes(*origptr);
> (gdb) bt
> #0  sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48,
> src=src@REDACTED=0x7fffe9ec2028 "", src_size=src_size@REDACTED=600224) at
> beam/erl_gc.c:1816
> #1  0x0000000000527ea0 in do_minor (nobj=1, objv=0x7ffff3b3dd50,
> new_sz=121536, p=0x7ffff5c80800) at beam/erl_gc.c:1160
> #2  minor_collection (recl=<synthetic pointer>, nobj=1,
> objv=0x7ffff3b3dd50, need=0, p=0x7ffff5c80800) at beam/erl_gc.c:876
> #3  erts_garbage_collect (p=0x7ffff5c80800, need=need@REDACTED=0,
> objv=objv@REDACTED=0x7ffff3b3dd50, nobj=nobj@REDACTED=1) at beam/erl_gc.c:450
> #4  0x000000000052877b in erts_gc_after_bif_call (p=0x7ffff5c80800,
> result=140736302308346, regs=<optimized out>, arity=<optimized out>) at
> beam/erl_gc.c:370
> #5  0x0000000000571951 in process_main () at beam/beam_emu.c:2787
> #6  0x00000000004a9a70 in sched_thread_func (vesdp=0x7ffff51cc8c0) at
> beam/erl_process.c:7743
> #7  0x00000000006056fb in thr_wrapper (vtwd=0x7fffffffd9a0) at
> pthread/ethread.c:106
> #8  0x00007ffff704d374 in start_thread () from /usr/lib/libpthread.so.0
> #9  0x00007ffff6b8327d in clone () from /usr/lib/libc.so.6
>
> *## The Systems ##*
>
>
> *PRODUCTION *Erlang/OTP 17.4 (also observed on Erlang R15B01)
> Amazon EC2 c3.8xlarge (32 Virtual CPUs, ~64 GB Memory)
> Debian Wheezy
> uname -a: Linux rtb0.ec2.chitika.net 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2
> x86_64 GNU/Linux
>
> *TESTING*
> Erlang/OTP 17.4
> Intel Core i5 760 @ 2.80GHz (4 Logical CPUs, 2 cores IIRC), ~16GB Memory
> Arch Linux (up-to-date)
> uname -a: Linux diogenes 4.0.1-1-ARCH #1 SMP PREEMPT Wed Apr 29 12:00:26
> CEST 2015 x86_64 GNU/Linux
>
>
> _______________________________________________
> erlang-bugs mailing listerlang-bugs@REDACTED://erlang.org/mailman/listinfo/erlang-bugs
>
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20150522/c238e28d/attachment.htm>


More information about the erlang-bugs mailing list