[erlang-bugs] Segmentation Fault in check_process_code / erts_garbage_collect

Fri May 22 20:54:07 CEST 2015

I am running 18.rc1 on a Macbook Air - I was able to duplicate your segv 
(although I haven't seen the dump) - I ran it for about 20 minutes total 
and it segv 3x in that time frame.

While running, I used the Mac Activity Monitor to peek.

It is using about 380% CPU, 19 Threads and seems to be only one OSX process.

Memory jumps around a bit - roughly 360-420 MB initially and then it 
drops down - I saw 127 MB, and then it was gone - crashed.

I will fish around for the crash dump.

Have fun

Bob G

On 05/21/2015 11:48 AM, Soup wrote:
> This topic, or one very similar, appears to have been discussed before 
> in the erlang-patches mailing list thread titled "erlang node crashes 
> in erts_gc_after_bif_call" from October, 2012 
> (http://erlang.org/pipermail/erlang-patches/2012-October/003072.html). 
> No clear resolution was reached on this thread, and I am currently 
> dealing with it in production systems, so I have decided to address 
> this mailing list.
>
> Please see the bottom of the email for system specification, as I 
> believe this to be largely unrelated (except possibly for multithreading).
>
> Please feel free to request any pertinent information I may have left 
> out, or to make suggestions to improve future bug reports. I don't 
> often submit bug reports, and am not at all familiar with Erlang/OTP's 
> particular practices in this regard.
>
> *## Scenario and Error ##
> *
> The error is a segmentation fault arising out of the 
> erts_garbage_collect and check_process_code functions.
>
> The scenario is as follows:
> 1) You must be hot-loading a module (in my case, this module is 
> dynamically generated) periodically.
> 2) You must have non-suspended processes active in the module you are 
> hot-loading while it is being loaded (though not necessarily *in* the 
> code of the module; may be using terms from the module or having 
> function references ot the module).
> 3) Purging of the *old* version of the module must be happening at the 
> same time as garbage collection. (in my case, the garbage collection 
> is explicit because of the use of large binary terms with relatively 
> few reductions; that does not appear to be the case in the situation 
> laid out in the previously mentioned thread).
>
> It appears, at least to my untrained eye, that garbage collection 
> sweeps can occur at the same time as code purging, and that this seems 
> to happen without multithreading protection. My reason for this 
> suspicion is that in my production systems I began receiving one of 
> two segmentation faults: one occuring in the function 
> check_process_code (of erts/emulator/beam/beam_bif_load.c) and 
> erts_garbage_collect (of erts/emulator/beam/erl_gc.c). Most of the 
> time *in production*, the segmentation fault occured in the 
> check_process_code function. Only sometimes did it appear to be coming 
> from erts_garbage_collect.
>
> *## Reproducing the Error ##*
>
> It took a while, but I did ultimately manage to create an app which 
> reliably produces this error (insofar as I can tell). Please see the 
> app here: https://github.com/fauxsoup/erlang-sigsegv
>
> There are some apparent differences from what I was observing in 
> production, but this could possibly be related to differences between 
> my production environment and my testing environment (which are 
> non-trivial), and potentially differences between my minimal test case 
> and the production service. Please see the bottom of this email for 
> pertinent details about both environments.
>
> For testing, and because my production deployment of Erlang does not 
> include debug symbols, I recompiled Erlang/OTP 17.4 with the flags "-g 
> -O2" to produce debug symbols and prevent aggressive optimizations 
> which may distort the stacktrace.
>
> The primary difference between the *results* of the error in 
> production versus testing is that the segmentation fault in testing 
> *always *comes from erts_garbage_collect. I have not at all been able 
> to produce a test result in which the segmentation fault occured in 
> check_process_code using the minimal test case code.
>
> Another difference, which I believe to be caused by the inclusion of 
> debug symbols, is that erts_garbage_collect appears earlier in the 
> backtrace in testing, and that the actual segmentation fault appears 
> to come from the function sweep_one_area (erl_gc.c again). My 
> assumption is that the optimization and lack of debug symbols in the 
> production system merely obfuscated the origin of the segmentation 
> fault there.
>
> *## The Backtrace ##
>
> *
> Included here for your convenience (also available in test case README):
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff3b3e700 (LWP 26743)]
> sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48, 
> src=src@REDACTED=0x7fffe9ec2028 "", src_size=src_size@REDACTED=600224) at 
> beam/erl_gc.c:1816
> 1816mb->base = binary_bytes(*origptr);
> (gdb) bt
> #0  sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48, 
> src=src@REDACTED=0x7fffe9ec2028 "", src_size=src_size@REDACTED=600224) at 
> beam/erl_gc.c:1816
> #1  0x0000000000527ea0 in do_minor (nobj=1, objv=0x7ffff3b3dd50, 
> new_sz=121536, p=0x7ffff5c80800) at beam/erl_gc.c:1160
> #2  minor_collection (recl=<synthetic pointer>, nobj=1, 
> objv=0x7ffff3b3dd50, need=0, p=0x7ffff5c80800) at beam/erl_gc.c:876
> #3  erts_garbage_collect (p=0x7ffff5c80800, need=need@REDACTED=0, 
> objv=objv@REDACTED=0x7ffff3b3dd50, nobj=nobj@REDACTED=1) at beam/erl_gc.c:450
> #4  0x000000000052877b in erts_gc_after_bif_call (p=0x7ffff5c80800, 
> result=140736302308346, regs=<optimized out>, arity=<optimized out>) 
> at beam/erl_gc.c:370
> #5  0x0000000000571951 in process_main () at beam/beam_emu.c:2787
> #6  0x00000000004a9a70 in sched_thread_func (vesdp=0x7ffff51cc8c0) at 
> beam/erl_process.c:7743
> #7  0x00000000006056fb in thr_wrapper (vtwd=0x7fffffffd9a0) at 
> pthread/ethread.c:106
> #8  0x00007ffff704d374 in start_thread () from /usr/lib/libpthread.so.0
> #9  0x00007ffff6b8327d in clone () from /usr/lib/libc.so.6
>
> *## The Systems ##*
>
> *PRODUCTION
> *Erlang/OTP 17.4 (also observed on Erlang R15B01)
> Amazon EC2 c3.8xlarge (32 Virtual CPUs, ~64 GB Memory)
> Debian Wheezy
> uname -a: Linux rtb0.ec2.chitika.net 
> <http://rtb0.ec2.chitika.net/> 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2 
> x86_64 GNU/Linux*
> *
>
> *TESTING*
> Erlang/OTP 17.4
> Intel Core i5 760 @ 2.80GHz (4 Logical CPUs, 2 cores IIRC), ~16GB Memory
> Arch Linux (up-to-date)
> uname -a: Linux diogenes 4.0.1-1-ARCH #1 SMP PREEMPT Wed Apr 29 
> 12:00:26 CEST 2015 x86_64 GNU/Linux
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20150522/2db1a18e/attachment.htm>