[erlang-bugs] Segmentation Fault in check_process_code / erts_garbage_collect
Bob Gustafson
bobgus@REDACTED
Fri May 22 20:54:07 CEST 2015
I am running 18.rc1 on a Macbook Air - I was able to duplicate your segv
(although I haven't seen the dump) - I ran it for about 20 minutes total
and it segv 3x in that time frame.
While running, I used the Mac Activity Monitor to peek.
It is using about 380% CPU, 19 Threads and seems to be only one OSX process.
Memory jumps around a bit - roughly 360-420 MB initially and then it
drops down - I saw 127 MB, and then it was gone - crashed.
I will fish around for the crash dump.
Have fun
Bob G
On 05/21/2015 11:48 AM, Soup wrote:
> This topic, or one very similar, appears to have been discussed before
> in the erlang-patches mailing list thread titled "erlang node crashes
> in erts_gc_after_bif_call" from October, 2012
> (http://erlang.org/pipermail/erlang-patches/2012-October/003072.html).
> No clear resolution was reached on this thread, and I am currently
> dealing with it in production systems, so I have decided to address
> this mailing list.
>
> Please see the bottom of the email for system specification, as I
> believe this to be largely unrelated (except possibly for multithreading).
>
> Please feel free to request any pertinent information I may have left
> out, or to make suggestions to improve future bug reports. I don't
> often submit bug reports, and am not at all familiar with Erlang/OTP's
> particular practices in this regard.
>
> *## Scenario and Error ##
> *
> The error is a segmentation fault arising out of the
> erts_garbage_collect and check_process_code functions.
>
> The scenario is as follows:
> 1) You must be hot-loading a module (in my case, this module is
> dynamically generated) periodically.
> 2) You must have non-suspended processes active in the module you are
> hot-loading while it is being loaded (though not necessarily *in* the
> code of the module; may be using terms from the module or having
> function references ot the module).
> 3) Purging of the *old* version of the module must be happening at the
> same time as garbage collection. (in my case, the garbage collection
> is explicit because of the use of large binary terms with relatively
> few reductions; that does not appear to be the case in the situation
> laid out in the previously mentioned thread).
>
> It appears, at least to my untrained eye, that garbage collection
> sweeps can occur at the same time as code purging, and that this seems
> to happen without multithreading protection. My reason for this
> suspicion is that in my production systems I began receiving one of
> two segmentation faults: one occuring in the function
> check_process_code (of erts/emulator/beam/beam_bif_load.c) and
> erts_garbage_collect (of erts/emulator/beam/erl_gc.c). Most of the
> time *in production*, the segmentation fault occured in the
> check_process_code function. Only sometimes did it appear to be coming
> from erts_garbage_collect.
>
> *## Reproducing the Error ##*
>
> It took a while, but I did ultimately manage to create an app which
> reliably produces this error (insofar as I can tell). Please see the
> app here: https://github.com/fauxsoup/erlang-sigsegv
>
> There are some apparent differences from what I was observing in
> production, but this could possibly be related to differences between
> my production environment and my testing environment (which are
> non-trivial), and potentially differences between my minimal test case
> and the production service. Please see the bottom of this email for
> pertinent details about both environments.
>
> For testing, and because my production deployment of Erlang does not
> include debug symbols, I recompiled Erlang/OTP 17.4 with the flags "-g
> -O2" to produce debug symbols and prevent aggressive optimizations
> which may distort the stacktrace.
>
> The primary difference between the *results* of the error in
> production versus testing is that the segmentation fault in testing
> *always *comes from erts_garbage_collect. I have not at all been able
> to produce a test result in which the segmentation fault occured in
> check_process_code using the minimal test case code.
>
> Another difference, which I believe to be caused by the inclusion of
> debug symbols, is that erts_garbage_collect appears earlier in the
> backtrace in testing, and that the actual segmentation fault appears
> to come from the function sweep_one_area (erl_gc.c again). My
> assumption is that the optimization and lack of debug symbols in the
> production system merely obfuscated the origin of the segmentation
> fault there.
>
> *## The Backtrace ##
>
> *
> Included here for your convenience (also available in test case README):
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff3b3e700 (LWP 26743)]
> sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48,
> src=src@REDACTED=0x7fffe9ec2028 "", src_size=src_size@REDACTED=600224) at
> beam/erl_gc.c:1816
> 1816mb->base = binary_bytes(*origptr);
> (gdb) bt
> #0 sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48,
> src=src@REDACTED=0x7fffe9ec2028 "", src_size=src_size@REDACTED=600224) at
> beam/erl_gc.c:1816
> #1 0x0000000000527ea0 in do_minor (nobj=1, objv=0x7ffff3b3dd50,
> new_sz=121536, p=0x7ffff5c80800) at beam/erl_gc.c:1160
> #2 minor_collection (recl=<synthetic pointer>, nobj=1,
> objv=0x7ffff3b3dd50, need=0, p=0x7ffff5c80800) at beam/erl_gc.c:876
> #3 erts_garbage_collect (p=0x7ffff5c80800, need=need@REDACTED=0,
> objv=objv@REDACTED=0x7ffff3b3dd50, nobj=nobj@REDACTED=1) at beam/erl_gc.c:450
> #4 0x000000000052877b in erts_gc_after_bif_call (p=0x7ffff5c80800,
> result=140736302308346, regs=<optimized out>, arity=<optimized out>)
> at beam/erl_gc.c:370
> #5 0x0000000000571951 in process_main () at beam/beam_emu.c:2787
> #6 0x00000000004a9a70 in sched_thread_func (vesdp=0x7ffff51cc8c0) at
> beam/erl_process.c:7743
> #7 0x00000000006056fb in thr_wrapper (vtwd=0x7fffffffd9a0) at
> pthread/ethread.c:106
> #8 0x00007ffff704d374 in start_thread () from /usr/lib/libpthread.so.0
> #9 0x00007ffff6b8327d in clone () from /usr/lib/libc.so.6
>
> *## The Systems ##*
>
> *PRODUCTION
> *Erlang/OTP 17.4 (also observed on Erlang R15B01)
> Amazon EC2 c3.8xlarge (32 Virtual CPUs, ~64 GB Memory)
> Debian Wheezy
> uname -a: Linux rtb0.ec2.chitika.net
> <http://rtb0.ec2.chitika.net/> 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2
> x86_64 GNU/Linux*
> *
>
> *TESTING*
> Erlang/OTP 17.4
> Intel Core i5 760 @ 2.80GHz (4 Logical CPUs, 2 cores IIRC), ~16GB Memory
> Arch Linux (up-to-date)
> uname -a: Linux diogenes 4.0.1-1-ARCH #1 SMP PREEMPT Wed Apr 29
> 12:00:26 CEST 2015 x86_64 GNU/Linux
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20150522/2db1a18e/attachment.htm>
More information about the erlang-bugs
mailing list