[erlang-bugs] Segmentation Fault in check_process_code / erts_garbage_collect

Björn-Egil Dahlberg wallentin.dahlberg@REDACTED
Fri May 22 21:43:40 CEST 2015


I would also like to add that your testcase for provoking this was very
helpful. The crash will almost always show a random symptom and rarely the
actual cause. The point of corruption has long since passed. Therefor a
testcase showing the error is the most helpful in hunting down the problem.

2015-05-22 21:32 GMT+02:00 Björn-Egil Dahlberg <wallentin.dahlberg@REDACTED
>:

> I'll just mention that we have looked at this in the VM team at OTP and
> can confirm the results. We also have a reasonable idea of what's happening.
> It seems that a binary match state is not handled properly in garbage
> collect literals (gc when purging code and moving literals to the process
> heaps).
>
> I will look into it more on monday (or perhaps this weekend).
>
> // Björn-Egil
>
> 2015-05-22 20:54 GMT+02:00 Bob Gustafson <bobgus@REDACTED>:
>
>>  I am running 18.rc1 on a Macbook Air - I was able to duplicate your segv
>> (although I haven't seen the dump) - I ran it for about 20 minutes total
>> and it segv 3x in that time frame.
>>
>> While running, I used the Mac Activity Monitor to peek.
>>
>> It is using about 380% CPU, 19 Threads and seems to be only one OSX
>> process.
>>
>> Memory jumps around a bit - roughly 360-420 MB initially and then it
>> drops down - I saw 127 MB, and then it was gone - crashed.
>>
>> I will fish around for the crash dump.
>>
>> Have fun
>>
>> Bob G
>>
>>
>> On 05/21/2015 11:48 AM, Soup wrote:
>>
>>  This topic, or one very similar, appears to have been discussed before
>> in the erlang-patches mailing list thread titled "erlang node crashes in
>> erts_gc_after_bif_call" from October, 2012 (
>> http://erlang.org/pipermail/erlang-patches/2012-October/003072.html). No
>> clear resolution was reached on this thread, and I am currently dealing
>> with it in production systems, so I have decided to address this mailing
>> list.
>>
>> Please see the bottom of the email for system specification, as I believe
>> this to be largely unrelated (except possibly for multithreading).
>>
>> Please feel free to request any pertinent information I may have left
>> out, or to make suggestions to improve future bug reports. I don't often
>> submit bug reports, and am not at all familiar with Erlang/OTP's particular
>> practices in this regard.
>>
>>
>> *## Scenario and Error ## *
>> The error is a segmentation fault arising out of the erts_garbage_collect
>> and check_process_code functions.
>>
>> The scenario is as follows:
>> 1) You must be hot-loading a module (in my case, this module is
>> dynamically generated) periodically.
>> 2) You must have non-suspended processes active in the module you are
>> hot-loading while it is being loaded (though not necessarily *in* the code
>> of the module; may be using terms from the module or having function
>> references ot the module).
>> 3) Purging of the *old* version of the module must be happening at the
>> same time as garbage collection. (in my case, the garbage collection is
>> explicit because of the use of large binary terms with relatively few
>> reductions; that does not appear to be the case in the situation laid out
>> in the previously mentioned thread).
>>
>> It appears, at least to my untrained eye, that garbage collection sweeps
>> can occur at the same time as code purging, and that this seems to happen
>> without multithreading protection. My reason for this suspicion is that in
>> my production systems I began receiving one of two segmentation faults: one
>> occuring in the function check_process_code (of
>> erts/emulator/beam/beam_bif_load.c) and erts_garbage_collect (of
>> erts/emulator/beam/erl_gc.c). Most of the time *in production*, the
>> segmentation fault occured in the check_process_code function. Only
>> sometimes did it appear to be coming from erts_garbage_collect.
>>
>> *## Reproducing the Error ##*
>>
>> It took a while, but I did ultimately manage to create an app which
>> reliably produces this error (insofar as I can tell). Please see the app
>> here: https://github.com/fauxsoup/erlang-sigsegv
>>
>> There are some apparent differences from what I was observing in
>> production, but this could possibly be related to differences between my
>> production environment and my testing environment (which are non-trivial),
>> and potentially differences between my minimal test case and the production
>> service. Please see the bottom of this email for pertinent details about
>> both environments.
>>
>> For testing, and because my production deployment of Erlang does not
>> include debug symbols, I recompiled Erlang/OTP 17.4 with the flags "-g -O2"
>> to produce debug symbols and prevent aggressive optimizations which may
>> distort the stacktrace.
>>
>> The primary difference between the *results* of the error in production
>> versus testing is that the segmentation fault in testing *always *comes
>> from erts_garbage_collect. I have not at all been able to produce a test
>> result in which the segmentation fault occured in check_process_code using
>> the minimal test case code.
>>
>> Another difference, which I believe to be caused by the inclusion of
>> debug symbols, is that erts_garbage_collect appears earlier in the
>> backtrace in testing, and that the actual segmentation fault appears to
>> come from the function sweep_one_area (erl_gc.c again). My assumption is
>> that the optimization and lack of debug symbols in the production system
>> merely obfuscated the origin of the segmentation fault there.
>>
>>
>>
>> *## The Backtrace ## *
>> Included here for your convenience (also available in test case README):
>>
>> Program received signal SIGSEGV, Segmentation fault.
>>                                                     [Switching to Thread
>> 0x7ffff3b3e700 (LWP 26743)]
>> sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48, src=src@REDACTED=0x7fffe9ec2028
>> "", src_size=src_size@REDACTED=600224) at beam/erl_gc.c:1816
>> 1816 mb->base = binary_bytes(*origptr);
>> (gdb) bt
>> #0  sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48,
>> src=src@REDACTED=0x7fffe9ec2028 "", src_size=src_size@REDACTED=600224) at
>> beam/erl_gc.c:1816
>> #1  0x0000000000527ea0 in do_minor (nobj=1, objv=0x7ffff3b3dd50,
>> new_sz=121536, p=0x7ffff5c80800) at beam/erl_gc.c:1160
>> #2  minor_collection (recl=<synthetic pointer>, nobj=1,
>> objv=0x7ffff3b3dd50, need=0, p=0x7ffff5c80800) at beam/erl_gc.c:876
>> #3  erts_garbage_collect (p=0x7ffff5c80800, need=need@REDACTED=0,
>> objv=objv@REDACTED=0x7ffff3b3dd50, nobj=nobj@REDACTED=1) at beam/erl_gc.c:450
>> #4  0x000000000052877b in erts_gc_after_bif_call (p=0x7ffff5c80800,
>> result=140736302308346, regs=<optimized out>, arity=<optimized out>) at
>> beam/erl_gc.c:370
>> #5  0x0000000000571951 in process_main () at beam/beam_emu.c:2787
>> #6  0x00000000004a9a70 in sched_thread_func (vesdp=0x7ffff51cc8c0) at
>> beam/erl_process.c:7743
>> #7  0x00000000006056fb in thr_wrapper (vtwd=0x7fffffffd9a0) at
>> pthread/ethread.c:106
>> #8  0x00007ffff704d374 in start_thread () from /usr/lib/libpthread.so.0
>> #9  0x00007ffff6b8327d in clone () from /usr/lib/libc.so.6
>>
>> *## The Systems ##*
>>
>>
>> *PRODUCTION *Erlang/OTP 17.4 (also observed on Erlang R15B01)
>> Amazon EC2 c3.8xlarge (32 Virtual CPUs, ~64 GB Memory)
>> Debian Wheezy
>> uname -a: Linux rtb0.ec2.chitika.net 3.2.0-4-amd64 #1 SMP Debian
>> 3.2.63-2 x86_64 GNU/Linux
>>
>> *TESTING*
>> Erlang/OTP 17.4
>> Intel Core i5 760 @ 2.80GHz (4 Logical CPUs, 2 cores IIRC), ~16GB Memory
>> Arch Linux (up-to-date)
>> uname -a: Linux diogenes 4.0.1-1-ARCH #1 SMP PREEMPT Wed Apr 29 12:00:26
>> CEST 2015 x86_64 GNU/Linux
>>
>>
>> _______________________________________________
>> erlang-bugs mailing listerlang-bugs@REDACTED://erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20150522/1347cc90/attachment.htm>


More information about the erlang-bugs mailing list