[erlang-bugs] beam core'ing

Thu Nov 22 16:23:40 CET 2012

Hi!

Thanks to everyone helping out trying to find this bug!

With the help of Denis, I have now a verified fix for the garbage 
collector bug which moved a "fixed" (and writable) binary in the middle 
of erts_bs_append (erts_bs_append in erl_bits.c was the "innocent 
bystander" triggering the gc bug). The bugfix will be a last minute 
contribution to R15B03, but I also attach a source patch to this mail.

Cheers,
/Patrik
On 11/21/2012 05:35 PM, Musumeci, Antonio S wrote:
>
> Something my team just noticed was that our segv occurs right after 
> reboot of the box consistantly. After which beam appears to work 
> alright. We are trying to narrow down what code is triggering it but 
> it may take some time.
>
> ------------------------------------------------------------------------
> *From:* Patrik Nyblom [mailto:pan@REDACTED]
> *Sent:* Wednesday, November 21, 2012 6:09 AM
> *To:* Denis Titoruk
> *Cc:* Musumeci, Antonio S (Enterprise Infrastructure); 
> erlang-bugs@REDACTED
> *Subject:* Re: [erlang-bugs] beam core'ing
>
> Hi again :)
>
> Another thing that would be helpful is if you could create a crash 
> dump instead of a fprintf when the binary is wrongly moved, i.e. call 
> erl_exit(ERTS_DUMP_EXIT, "erts_current_bin != (pb->bytes)"); instead 
> of the fprintf? Then you could isolate the erlang code snippet that 
> exercises the bug and I maybe could create a smaller testcase... A 
> simple testcase when diving into the GC would be really helpful :)
>
> Cheers,
> /Patrik
>
> On 11/21/2012 11:21 AM, Denis Titoruk wrote:
>>
>> 21.11.2012, в 13:44, Patrik Nyblom написал(а):
>>
>>> Hi!
>>> On 11/20/2012 10:40 PM, Denis Titoruk wrote:
>>>> Hi,
>>>>
>>>> We've got the same error on R15B01, R15B02
>>>> I've finished my investigation of this issue today & here is result:
>>>>
>>>> Let's assume we have the code:
>>>> encode_formats(Columns) ->
>>>>    encode_formats(Columns, 0, <<>>).
>>>>
>>>> encode_formats([], Count, Acc) ->
>>>>    <<Count:?int16, Acc/binary>>;
>>>>
>>>> encode_formats([#column{format = Format} | T], Count, Acc) ->
>>>>    encode_formats(T, Count + 1, <<Acc/binary, Format:?int16>>).
>>>>
>>>> So, <<Acc/binary, Format:?int16>> translates to
>>>>
>>>>  {bs_append,{f,0},{integer,16},0,7,8,{x,2},{field_flags,[]},{x,1}}.
>>>>    {bs_put_integer,{f,0},{integer,16},1,{field_flags,[signed,big]},{x,6}}.
>>>>
>>>> There is GC execution in bs_append and it can reallocate binary but 
>>>> there isn't reassigning erts_current_bin which used in bs_put_integer.
>>>>
>>>> Fix:
>>>>
>>>> erl_bits.c:
>>>> Eterm
>>>> erts_bs_append(Process* c_p, Eterm* reg, Uint live, Eterm 
>>>> build_size_term,
>>>>        Uint extra_words, Uint unit)
>>>> ...
>>>>    if (c_p->stop - c_p->htop < heap_need) {
>>>>        (void) erts_garbage_collect(c_p, heap_need, reg, live+1);
>>>>    }
>>>>    sb = (ErlSubBin *) c_p->htop;
>>>>    c_p->htop += ERL_SUB_BIN_SIZE;
>>>>    sb->thing_word = HEADER_SUB_BIN;
>>>>    sb->size = BYTE_OFFSET(used_size_in_bits);
>>>>    sb->bitsize = BIT_OFFSET(used_size_in_bits);
>>>>    sb->offs = 0;
>>>>    sb->bitoffs = 0;
>>>>    sb->is_writable = 1;
>>>>    sb->orig = reg[live];
>>>>
>>>> ///////////////////////////////////////////////////////////////////
>>>> // add this lines
>>>> ///////////////////////////////////////////////////////////////////
>>>>    pb = (ProcBin *) boxed_val(sb->orig);
>>>>    erts_current_bin = pb->bytes;
>>>>    erts_writable_bin = 1;
>>>> ///////////////////////////////////////////////////////////////////
>>>>
>>>>    return make_binary(sb);
>>>> ...
>>>>
>>> Can you reproduce the bug and verify that this fix really works? The 
>>> thing is that binaries should *only* be reallocated in the gc if 
>>> there are no active writers, which there obviously is here (    
>>> pb->flags |= PB_ACTIVE_WRITER  a few lines earlier), so the bug 
>>> would be in the detection of active writers in the gc if this code 
>>> change actually removes the crash.
>>
>> Yes, it works in my case. I haven't simple test case for reproducing 
>> this bug (actually I run few processes to send requests to pgsql)
>>
>>     pb = (ProcBin *) boxed_val(sb->orig);
>>     if (erts_current_bin != (pb->bytes)) {
>>         fprintf(stderr, "erts_current_bin != (pb->bytes)\n");
>>         fflush(stderr);
>>     }
>>     erts_current_bin = pb->bytes;
>>     erts_writable_bin = 1;
>>
>>
>> (jskit@REDACTED)1> f(F), F = fun() -> 
>> postgresql:equery('echo-customers', write, <<"some query here">>, []) 
>> end.
>> #Fun<erl_eval.20.82930912>
>> (jskit@REDACTED)2> perftest:comprehensive(1000, F).
>> Sequential 100 cycles in ~1 seconds (100 cycles/s)
>> Sequential 200 cycles in ~2 seconds (106 cycles/s)
>> Sequential 1000 cycles in ~12 seconds (85 cycles/s)
>> Parallel 2 1000 cycles in ~8 seconds (132 cycles/s)
>> Parallel 4 1000 cycles in ~8 seconds (121 cycles/s)
>> Parallel 10 1000 cycles in ~8 seconds (119 cycles/s)
>> Parallel 100 1000 cycles in ~13 seconds (74 cycles/s)
>> [85,132,121,119,74]
>> (jskit@REDACTED)3> perftest:comprehensive(1000, F).
>> Sequential 100 cycles in ~1 seconds (83 cycles/s)
>> Sequential 200 cycles in ~2 seconds (83 cycles/s)
>> Sequential 1000 cycles in ~14 seconds (71 cycles/s)
>> Parallel 2 1000 cycles in ~11 seconds (95 cycles/s)
>> Parallel 4 1000 cycles in ~10 seconds (105 cycles/s)
>> Parallel 10 1000 cycles in ~11 seconds (91 cycles/s)
>> Parallel 100 1000 cycles in ~13 seconds (76 cycles/s)
>> "G_i[L"
>> (jskit@REDACTED)4> perftest:comprehensive(1000, F).
>> Sequential 100 cycles in ~1 seconds (88 cycles/s)
>> Sequential 200 cycles in ~2 seconds (85 cycles/s)
>> Sequential 1000 cycles in ~13 seconds (74 cycles/s)
>> Parallel 2 1000 cycles in ~9 seconds (109 cycles/s)
>> Parallel 4 1000 cycles in ~10 seconds (101 cycles/s)
>> Parallel 10 1000 cycles in ~11 seconds (95 cycles/s)
>> erts_current_bin != (pb->bytes)
>> Parallel 100 1000 cycles in ~13 seconds (77 cycles/s)
>> "Jme_M"
>>
>>>
>>>>
>>>> --
>>>> Cheers,
>>>> Denis
>>> Cheers,
>>> /Patrik
>>>>
>>>> 20.11.2012, в 19:37, Musumeci, Antonio S написал(а):
>>>>
>>>>>
>>>>> I've got lots of cores... but they are all from optimized builds.
>>>>>
>>>>> Has this been seen in other versions? We are keen to solve this 
>>>>> because it's causing us pain in production. We hit another, older, 
>>>>> memory bug (the 32bit values used in 64bit build)... and now this.
>>>>>
>>>>> I'm going to be building and trying R15B01 to see if we hit it as 
>>>>> well. I'll send any additional information I can.Any suggestions 
>>>>> on debugging beam would be appreciated. Compile options, etc.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> -antonio
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:*erlang-bugs-bounces@REDACTED 
>>>>> <mailto:erlang-bugs-bounces@REDACTED>[mailto:erlang-bugs-bounces@REDACTED]*On 
>>>>> Behalf Of*Patrik Nyblom
>>>>> *Sent:*Monday, November 19, 2012 8:55 AM
>>>>> *To:*erlang-bugs@REDACTED <mailto:erlang-bugs@REDACTED>
>>>>> *Subject:*Re: [erlang-bugs] beam core'ing
>>>>>
>>>>> On 11/19/2012 02:01 PM, Musumeci, Antonio S wrote:
>>>>>>
>>>>>> I'm just starting to debug this but figured I'd send it along in 
>>>>>> case anyone has seen this before.
>>>>>>
>>>>>> 64bit RHEL 5.0.1
>>>>>>
>>>>>> built from source beam.smp R15B02
>>>>>>
>>>>>> Happens consistently when trying to start our app and then just 
>>>>>> stops after a time. Across a few boxes. Oddly we have an 
>>>>>> identical cluster (hw and sw) and it never happens.
>>>>>>
>>>>> Yes! I've seen it before and have tried for several months to get 
>>>>> areproducable example and acore i can analyze here. I've had one 
>>>>> core that wassomewhat readable but had no luck in locating the 
>>>>> beam code that triggered this. If you could try narrowing it down, 
>>>>> I would be really grateful!
>>>>>
>>>>> Please email me any findings, theories, cores dumps- anything! I 
>>>>> really want to find this! The most interesting would be to find 
>>>>> the snippet of erlang code that makes this happen (intermittently 
>>>>> probably).
>>>>>
>>>>> The problem isthatwhen the allocators crash, the error is usually 
>>>>> somewhere else.Access of freed memory, double free or something 
>>>>> else doing horrid things to memory. Obviously none of our 
>>>>> testsuites exercise this bug asneither our debug builds, nor our 
>>>>> valgrind runs find it. It happens on both SMP and non SMP and is 
>>>>> always in the context of the erts_bs_append, so I'm pretty sure 
>>>>> this has a connection to the other users seeing the crash in the 
>>>>> allocators...
>>>>>
>>>>> Cheers,
>>>>> Patrik
>>>>>>
>>>>>> #0 bf_unlink_free_block (flags=<optimized out>, block=0x6f00, 
>>>>>> allctr=<optimized out>) at beam/erl_bestfit_alloc.c:789
>>>>>> #1 bf_get_free_block (allctr=0x6824600, size=304, cand_blk=0x0, 
>>>>>> cand_size=<optimized out>, flags=0) at beam/erl_bestfit_alloc.c:869
>>>>>> #2 0x000000000045343c in mbc_alloc_block (alcu_flgsp=<optimized 
>>>>>> out>, blk_szp=<optimized out>, size=<optimized out>, 
>>>>>> allctr=<optimized out>) at beam/erl_alloc_util.c:1198
>>>>>> #3 mbc_alloc (allctr=0x6824600, size=295) at 
>>>>>> beam/erl_alloc_util.c:1345
>>>>>> #4 0x000000000045398d in do_erts_alcu_alloc (type=164, 
>>>>>> extra=0x6824600, size=295) at beam/erl_alloc_util.c:3442
>>>>>> #5 0x0000000000453a0f in erts_alcu_alloc_thr_pref (type=164, 
>>>>>> extra=<optimized out>, size=287) at beam/erl_alloc_util.c:3520
>>>>>> #6 0x0000000000511463 in erts_alloc (size=287, type=<optimized 
>>>>>> out>) at beam/erl_alloc.h:208
>>>>>> #7 erts_bin_nrml_alloc (size=<optimized out>) at 
>>>>>> beam/erl_binary.h:260
>>>>>> #8 erts_bs_append (c_p=0x69fba60, reg=<optimized out>, 
>>>>>> live=<optimized out>, build_size_term=<optimized out>, 
>>>>>> extra_words=0, unit=8)at beam/erl_bits.c:1327
>>>>>> #9 0x000000000053ffd8 in process_main () at beam/beam_emu.c:3858
>>>>>> #10 0x00000000004ae853 in sched_thread_func (vesdp=<optimized 
>>>>>> out>) at beam/erl_process.c:5184
>>>>>> #11 0x00000000005c17e9 in thr_wrapper (vtwd=<optimized out>) at 
>>>>>> pthread/ethread.c:106
>>>>>> #12 0x00002b430f39e73d in start_thread () from /lib64/libpthread.so.0
>>>>>> #13 0x00002b430f890f6d in clone () from /lib64/libc.so.6
>>>>>> #14 0x0000000000000000 in ?? ()
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> erlang-bugs mailing list
>>>>>> erlang-bugs@REDACTED
>>>>>> http://erlang.org/mailman/listinfo/erlang-bugs
>>>>>
>>>>> _______________________________________________
>>>>> erlang-bugs mailing list
>>>>> erlang-bugs@REDACTED <mailto:erlang-bugs@REDACTED>
>>>>> http://erlang.org/mailman/listinfo/erlang-bugs
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20121122/7e80a8a7/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bs_append_crash.diff
Type: text/x-patch
Size: 1053 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20121122/7e80a8a7/attachment.bin>