[erlang-bugs] Segmentation Fault in check_process_code / erts_garbage_collect

Sat May 23 01:40:07 CEST 2015

Could you see if the following solves the problem:

git fetch https://github.com/psyeugenic/otp.git egil/fix-purge-literals

That branch is based on OTP-17.5

https://github.com/psyeugenic/otp/compare/maint...psyeugenic:egil/fix-purge-literals

Just keep in mind that this hasn't been rigorously tested, only ran
erlang-sigsegv which is still going strong.

// Björn-Egil

2015-05-22 23:31 GMT+02:00 Soup <zachary.hueras@REDACTED>:

> I am only upset that I am not able to replicate the original stacktrace I
> saw in my production servers with this example.
>
> However, what little I was able to glean from the -g free stacktraces
> showed that both the garbage collection and check process code errors were
> happening when interacting with the "off heap" of the process. Not sure
> what would be in the off heap, but binaries sound like they would be
> (though I think there was something about function references to the old
> version of the module in the vicinity as well).
>
> Please let me know if I can provide any more information.
>
> ~Zac
>
> On Fri, May 22, 2015 at 3:43 PM, Björn-Egil Dahlberg <
> wallentin.dahlberg@REDACTED> wrote:
>
>> I would also like to add that your testcase for provoking this was very
>> helpful. The crash will almost always show a random symptom and rarely the
>> actual cause. The point of corruption has long since passed. Therefor a
>> testcase showing the error is the most helpful in hunting down the problem.
>>
>> 2015-05-22 21:32 GMT+02:00 Björn-Egil Dahlberg <
>> wallentin.dahlberg@REDACTED>:
>>
>>> I'll just mention that we have looked at this in the VM team at OTP and
>>> can confirm the results. We also have a reasonable idea of what's happening.
>>> It seems that a binary match state is not handled properly in garbage
>>> collect literals (gc when purging code and moving literals to the process
>>> heaps).
>>>
>>> I will look into it more on monday (or perhaps this weekend).
>>>
>>> // Björn-Egil
>>>
>>> 2015-05-22 20:54 GMT+02:00 Bob Gustafson <bobgus@REDACTED>:
>>>
>>>>  I am running 18.rc1 on a Macbook Air - I was able to duplicate your
>>>> segv (although I haven't seen the dump) - I ran it for about 20 minutes
>>>> total and it segv 3x in that time frame.
>>>>
>>>> While running, I used the Mac Activity Monitor to peek.
>>>>
>>>> It is using about 380% CPU, 19 Threads and seems to be only one OSX
>>>> process.
>>>>
>>>> Memory jumps around a bit - roughly 360-420 MB initially and then it
>>>> drops down - I saw 127 MB, and then it was gone - crashed.
>>>>
>>>> I will fish around for the crash dump.
>>>>
>>>> Have fun
>>>>
>>>> Bob G
>>>>
>>>>
>>>> On 05/21/2015 11:48 AM, Soup wrote:
>>>>
>>>>  This topic, or one very similar, appears to have been discussed
>>>> before in the erlang-patches mailing list thread titled "erlang node
>>>> crashes in erts_gc_after_bif_call" from October, 2012 (
>>>> http://erlang.org/pipermail/erlang-patches/2012-October/003072.html).
>>>> No clear resolution was reached on this thread, and I am currently dealing
>>>> with it in production systems, so I have decided to address this mailing
>>>> list.
>>>>
>>>> Please see the bottom of the email for system specification, as I
>>>> believe this to be largely unrelated (except possibly for multithreading).
>>>>
>>>> Please feel free to request any pertinent information I may have left
>>>> out, or to make suggestions to improve future bug reports. I don't often
>>>> submit bug reports, and am not at all familiar with Erlang/OTP's particular
>>>> practices in this regard.
>>>>
>>>>
>>>> *## Scenario and Error ## *
>>>> The error is a segmentation fault arising out of the
>>>> erts_garbage_collect and check_process_code functions.
>>>>
>>>> The scenario is as follows:
>>>> 1) You must be hot-loading a module (in my case, this module is
>>>> dynamically generated) periodically.
>>>> 2) You must have non-suspended processes active in the module you are
>>>> hot-loading while it is being loaded (though not necessarily *in* the code
>>>> of the module; may be using terms from the module or having function
>>>> references ot the module).
>>>> 3) Purging of the *old* version of the module must be happening at the
>>>> same time as garbage collection. (in my case, the garbage collection is
>>>> explicit because of the use of large binary terms with relatively few
>>>> reductions; that does not appear to be the case in the situation laid out
>>>> in the previously mentioned thread).
>>>>
>>>> It appears, at least to my untrained eye, that garbage collection
>>>> sweeps can occur at the same time as code purging, and that this seems to
>>>> happen without multithreading protection. My reason for this suspicion is
>>>> that in my production systems I began receiving one of two segmentation
>>>> faults: one occuring in the function check_process_code (of
>>>> erts/emulator/beam/beam_bif_load.c) and erts_garbage_collect (of
>>>> erts/emulator/beam/erl_gc.c). Most of the time *in production*, the
>>>> segmentation fault occured in the check_process_code function. Only
>>>> sometimes did it appear to be coming from erts_garbage_collect.
>>>>
>>>> *## Reproducing the Error ##*
>>>>
>>>> It took a while, but I did ultimately manage to create an app which
>>>> reliably produces this error (insofar as I can tell). Please see the app
>>>> here: https://github.com/fauxsoup/erlang-sigsegv
>>>>
>>>> There are some apparent differences from what I was observing in
>>>> production, but this could possibly be related to differences between my
>>>> production environment and my testing environment (which are non-trivial),
>>>> and potentially differences between my minimal test case and the production
>>>> service. Please see the bottom of this email for pertinent details about
>>>> both environments.
>>>>
>>>> For testing, and because my production deployment of Erlang does not
>>>> include debug symbols, I recompiled Erlang/OTP 17.4 with the flags "-g -O2"
>>>> to produce debug symbols and prevent aggressive optimizations which may
>>>> distort the stacktrace.
>>>>
>>>> The primary difference between the *results* of the error in production
>>>> versus testing is that the segmentation fault in testing *always *comes
>>>> from erts_garbage_collect. I have not at all been able to produce a test
>>>> result in which the segmentation fault occured in check_process_code using
>>>> the minimal test case code.
>>>>
>>>> Another difference, which I believe to be caused by the inclusion of
>>>> debug symbols, is that erts_garbage_collect appears earlier in the
>>>> backtrace in testing, and that the actual segmentation fault appears to
>>>> come from the function sweep_one_area (erl_gc.c again). My assumption is
>>>> that the optimization and lack of debug symbols in the production system
>>>> merely obfuscated the origin of the segmentation fault there.
>>>>
>>>>
>>>>
>>>> *## The Backtrace ## *
>>>> Included here for your convenience (also available in test case README):
>>>>
>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>                                                     [Switching to
>>>> Thread 0x7ffff3b3e700 (LWP 26743)]
>>>> sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48,
>>>> src=src@REDACTED=0x7fffe9ec2028 "", src_size=src_size@REDACTED=600224) at
>>>> beam/erl_gc.c:1816
>>>> 1816 mb->base = binary_bytes(*origptr);
>>>> (gdb) bt
>>>> #0  sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48,
>>>> src=src@REDACTED=0x7fffe9ec2028 "", src_size=src_size@REDACTED=600224) at
>>>> beam/erl_gc.c:1816
>>>> #1  0x0000000000527ea0 in do_minor (nobj=1, objv=0x7ffff3b3dd50,
>>>> new_sz=121536, p=0x7ffff5c80800) at beam/erl_gc.c:1160
>>>> #2  minor_collection (recl=<synthetic pointer>, nobj=1,
>>>> objv=0x7ffff3b3dd50, need=0, p=0x7ffff5c80800) at beam/erl_gc.c:876
>>>> #3  erts_garbage_collect (p=0x7ffff5c80800, need=need@REDACTED=0,
>>>> objv=objv@REDACTED=0x7ffff3b3dd50, nobj=nobj@REDACTED=1) at beam/erl_gc.c:450
>>>> #4  0x000000000052877b in erts_gc_after_bif_call (p=0x7ffff5c80800,
>>>> result=140736302308346, regs=<optimized out>, arity=<optimized out>) at
>>>> beam/erl_gc.c:370
>>>> #5  0x0000000000571951 in process_main () at beam/beam_emu.c:2787
>>>> #6  0x00000000004a9a70 in sched_thread_func (vesdp=0x7ffff51cc8c0) at
>>>> beam/erl_process.c:7743
>>>> #7  0x00000000006056fb in thr_wrapper (vtwd=0x7fffffffd9a0) at
>>>> pthread/ethread.c:106
>>>> #8  0x00007ffff704d374 in start_thread () from /usr/lib/libpthread.so.0
>>>> #9  0x00007ffff6b8327d in clone () from /usr/lib/libc.so.6
>>>>
>>>> *## The Systems ##*
>>>>
>>>>
>>>> *PRODUCTION *Erlang/OTP 17.4 (also observed on Erlang R15B01)
>>>> Amazon EC2 c3.8xlarge (32 Virtual CPUs, ~64 GB Memory)
>>>> Debian Wheezy
>>>> uname -a: Linux rtb0.ec2.chitika.net 3.2.0-4-amd64 #1 SMP Debian
>>>> 3.2.63-2 x86_64 GNU/Linux
>>>>
>>>> *TESTING*
>>>> Erlang/OTP 17.4
>>>> Intel Core i5 760 @ 2.80GHz (4 Logical CPUs, 2 cores IIRC), ~16GB Memory
>>>> Arch Linux (up-to-date)
>>>> uname -a: Linux diogenes 4.0.1-1-ARCH #1 SMP PREEMPT Wed Apr 29
>>>> 12:00:26 CEST 2015 x86_64 GNU/Linux
>>>>
>>>>
>>>> _______________________________________________
>>>> erlang-bugs mailing listerlang-bugs@REDACTED://erlang.org/mailman/listinfo/erlang-bugs
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> erlang-bugs mailing list
>>>> erlang-bugs@REDACTED
>>>> http://erlang.org/mailman/listinfo/erlang-bugs
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20150523/f524f778/attachment.htm>