[erlang-bugs] Segmentation Fault in check_process_code / erts_garbage_collect

Fri May 22 23:31:13 CEST 2015

I am only upset that I am not able to replicate the original stacktrace I
saw in my production servers with this example.

However, what little I was able to glean from the -g free stacktraces
showed that both the garbage collection and check process code errors were
happening when interacting with the "off heap" of the process. Not sure
what would be in the off heap, but binaries sound like they would be
(though I think there was something about function references to the old
version of the module in the vicinity as well).

Please let me know if I can provide any more information.

~Zac

On Fri, May 22, 2015 at 3:43 PM, Björn-Egil Dahlberg <
wallentin.dahlberg@REDACTED> wrote:

> I would also like to add that your testcase for provoking this was very
> helpful. The crash will almost always show a random symptom and rarely the
> actual cause. The point of corruption has long since passed. Therefor a
> testcase showing the error is the most helpful in hunting down the problem.
>
> 2015-05-22 21:32 GMT+02:00 Björn-Egil Dahlberg <
> wallentin.dahlberg@REDACTED>:
>
>> I'll just mention that we have looked at this in the VM team at OTP and
>> can confirm the results. We also have a reasonable idea of what's happening.
>> It seems that a binary match state is not handled properly in garbage
>> collect literals (gc when purging code and moving literals to the process
>> heaps).
>>
>> I will look into it more on monday (or perhaps this weekend).
>>
>> // Björn-Egil
>>
>> 2015-05-22 20:54 GMT+02:00 Bob Gustafson <bobgus@REDACTED>:
>>
>>>  I am running 18.rc1 on a Macbook Air - I was able to duplicate your
>>> segv (although I haven't seen the dump) - I ran it for about 20 minutes
>>> total and it segv 3x in that time frame.
>>>
>>> While running, I used the Mac Activity Monitor to peek.
>>>
>>> It is using about 380% CPU, 19 Threads and seems to be only one OSX
>>> process.
>>>
>>> Memory jumps around a bit - roughly 360-420 MB initially and then it
>>> drops down - I saw 127 MB, and then it was gone - crashed.
>>>
>>> I will fish around for the crash dump.
>>>
>>> Have fun
>>>
>>> Bob G
>>>
>>>
>>> On 05/21/2015 11:48 AM, Soup wrote:
>>>
>>>  This topic, or one very similar, appears to have been discussed before
>>> in the erlang-patches mailing list thread titled "erlang node crashes in
>>> erts_gc_after_bif_call" from October, 2012 (
>>> http://erlang.org/pipermail/erlang-patches/2012-October/003072.html).
>>> No clear resolution was reached on this thread, and I am currently dealing
>>> with it in production systems, so I have decided to address this mailing
>>> list.
>>>
>>> Please see the bottom of the email for system specification, as I
>>> believe this to be largely unrelated (except possibly for multithreading).
>>>
>>> Please feel free to request any pertinent information I may have left
>>> out, or to make suggestions to improve future bug reports. I don't often
>>> submit bug reports, and am not at all familiar with Erlang/OTP's particular
>>> practices in this regard.
>>>
>>>
>>> *## Scenario and Error ## *
>>> The error is a segmentation fault arising out of the
>>> erts_garbage_collect and check_process_code functions.
>>>
>>> The scenario is as follows:
>>> 1) You must be hot-loading a module (in my case, this module is
>>> dynamically generated) periodically.
>>> 2) You must have non-suspended processes active in the module you are
>>> hot-loading while it is being loaded (though not necessarily *in* the code
>>> of the module; may be using terms from the module or having function
>>> references ot the module).
>>> 3) Purging of the *old* version of the module must be happening at the
>>> same time as garbage collection. (in my case, the garbage collection is
>>> explicit because of the use of large binary terms with relatively few
>>> reductions; that does not appear to be the case in the situation laid out
>>> in the previously mentioned thread).
>>>
>>> It appears, at least to my untrained eye, that garbage collection sweeps
>>> can occur at the same time as code purging, and that this seems to happen
>>> without multithreading protection. My reason for this suspicion is that in
>>> my production systems I began receiving one of two segmentation faults: one
>>> occuring in the function check_process_code (of
>>> erts/emulator/beam/beam_bif_load.c) and erts_garbage_collect (of
>>> erts/emulator/beam/erl_gc.c). Most of the time *in production*, the
>>> segmentation fault occured in the check_process_code function. Only
>>> sometimes did it appear to be coming from erts_garbage_collect.
>>>
>>> *## Reproducing the Error ##*
>>>
>>> It took a while, but I did ultimately manage to create an app which
>>> reliably produces this error (insofar as I can tell). Please see the app
>>> here: https://github.com/fauxsoup/erlang-sigsegv
>>>
>>> There are some apparent differences from what I was observing in
>>> production, but this could possibly be related to differences between my
>>> production environment and my testing environment (which are non-trivial),
>>> and potentially differences between my minimal test case and the production
>>> service. Please see the bottom of this email for pertinent details about
>>> both environments.
>>>
>>> For testing, and because my production deployment of Erlang does not
>>> include debug symbols, I recompiled Erlang/OTP 17.4 with the flags "-g -O2"
>>> to produce debug symbols and prevent aggressive optimizations which may
>>> distort the stacktrace.
>>>
>>> The primary difference between the *results* of the error in production
>>> versus testing is that the segmentation fault in testing *always *comes
>>> from erts_garbage_collect. I have not at all been able to produce a test
>>> result in which the segmentation fault occured in check_process_code using
>>> the minimal test case code.
>>>
>>> Another difference, which I believe to be caused by the inclusion of
>>> debug symbols, is that erts_garbage_collect appears earlier in the
>>> backtrace in testing, and that the actual segmentation fault appears to
>>> come from the function sweep_one_area (erl_gc.c again). My assumption is
>>> that the optimization and lack of debug symbols in the production system
>>> merely obfuscated the origin of the segmentation fault there.
>>>
>>>
>>>
>>> *## The Backtrace ## *
>>> Included here for your convenience (also available in test case README):
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>>                                                     [Switching to Thread
>>> 0x7ffff3b3e700 (LWP 26743)]
>>> sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48, src=src@REDACTED=0x7fffe9ec2028
>>> "", src_size=src_size@REDACTED=600224) at beam/erl_gc.c:1816
>>> 1816 mb->base = binary_bytes(*origptr);
>>> (gdb) bt
>>> #0  sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48,
>>> src=src@REDACTED=0x7fffe9ec2028 "", src_size=src_size@REDACTED=600224) at
>>> beam/erl_gc.c:1816
>>> #1  0x0000000000527ea0 in do_minor (nobj=1, objv=0x7ffff3b3dd50,
>>> new_sz=121536, p=0x7ffff5c80800) at beam/erl_gc.c:1160
>>> #2  minor_collection (recl=<synthetic pointer>, nobj=1,
>>> objv=0x7ffff3b3dd50, need=0, p=0x7ffff5c80800) at beam/erl_gc.c:876
>>> #3  erts_garbage_collect (p=0x7ffff5c80800, need=need@REDACTED=0,
>>> objv=objv@REDACTED=0x7ffff3b3dd50, nobj=nobj@REDACTED=1) at beam/erl_gc.c:450
>>> #4  0x000000000052877b in erts_gc_after_bif_call (p=0x7ffff5c80800,
>>> result=140736302308346, regs=<optimized out>, arity=<optimized out>) at
>>> beam/erl_gc.c:370
>>> #5  0x0000000000571951 in process_main () at beam/beam_emu.c:2787
>>> #6  0x00000000004a9a70 in sched_thread_func (vesdp=0x7ffff51cc8c0) at
>>> beam/erl_process.c:7743
>>> #7  0x00000000006056fb in thr_wrapper (vtwd=0x7fffffffd9a0) at
>>> pthread/ethread.c:106
>>> #8  0x00007ffff704d374 in start_thread () from /usr/lib/libpthread.so.0
>>> #9  0x00007ffff6b8327d in clone () from /usr/lib/libc.so.6
>>>
>>> *## The Systems ##*
>>>
>>>
>>> *PRODUCTION *Erlang/OTP 17.4 (also observed on Erlang R15B01)
>>> Amazon EC2 c3.8xlarge (32 Virtual CPUs, ~64 GB Memory)
>>> Debian Wheezy
>>> uname -a: Linux rtb0.ec2.chitika.net 3.2.0-4-amd64 #1 SMP Debian
>>> 3.2.63-2 x86_64 GNU/Linux
>>>
>>> *TESTING*
>>> Erlang/OTP 17.4
>>> Intel Core i5 760 @ 2.80GHz (4 Logical CPUs, 2 cores IIRC), ~16GB Memory
>>> Arch Linux (up-to-date)
>>> uname -a: Linux diogenes 4.0.1-1-ARCH #1 SMP PREEMPT Wed Apr 29 12:00:26
>>> CEST 2015 x86_64 GNU/Linux
>>>
>>>
>>> _______________________________________________
>>> erlang-bugs mailing listerlang-bugs@REDACTED://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>>
>>> _______________________________________________
>>> erlang-bugs mailing list
>>> erlang-bugs@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20150522/ea867cec/attachment.htm>