[erlang-questions] beam[8449]: segfault at 0 ip 0000000000437e10 sp 00007fffce250948 error 4 in beam[400000+174000]

Sat Jun 12 11:45:26 CEST 2010

Eric Liang wrote:
> On 05/27/2010 02:14 AM, Mikael Pettersson wrote:
> > Eric Liang wrote:
> >  =20
> >> I've done a build of the source, but it just can't match the object. H=
> ow
> >> do you make it? I use the command: apt-get source to get the source, s=
> o
> >> it does have the same version with the object.
> >>    =20
> > I did:
> >
> >  =20
> >> tar zxvf otp_src_R13B03.tar.gz
> >> cd otp_src_R13B03
> >> ./configure; make
> >>    =20
> > The binary files of interest are bin/x86_64-unknown-linux-gnu/beam and =
> 
> > erts/emulator/obj/x86_64-unknown-linux-gnu/opt/plain/erl_goodfit_alloc.=
> o.
> >
> >  =20
> Thanks Mikael, and sorry for replying you too late as the seg-fault is
> not occured every time.
> 
> I get the debug symbols by this:
> 
>     http://forum.nginx.org/read.php?26,93440,94735
> 
> >>> You can get a stack dump from the crash by attaching gdb to the
> >>> soon-to-crash beam process. Now instead of being terminated gdb will
> >>> get control of the process and you should be able to print a stack
> >>> trace with bt or where. (This does require that there's a sufficient
> >>> time window from the start of the application to the crash.)
> >>>  =3D20
> >>>      =20
> >> I've make a core dump 4 seconds before it crash,  as mentioned above,=3D=
> 20
> >> because don't get the right symbols, it just with some quesion-marks:
> >>
> >>     Core was generated by `/usr/lib/erlang/erts-5.7.2/bin/beam'.
> >>     #0  0x00007f0a28ecd5a9 in ?? ()
> >>     (gdb) whe
> >>     #0  0x00007f0a28ecd5a9 in ?? ()
> >>     #1  0x0000000000000000 in ?? ()
> >>     (gdb)
> >>    =20
> > A core dump from a time point before the crash is useless. Either get a=
> 
> > core dump from the crash itself (execute `ulimit -c unlimited' in bash
> > before running the test), or attach gdb, continue the process, and wait=
> 
> > for gdb to receive control when the crash occurs.
> >  =20
> 
> I do set the ulimit -c in /etc/profile and after I reboot it:
> 
>     sunny@REDACTED:~$ ulimit -c
>     unlimited
>     sunny@REDACTED:~/commands$ cat /proc/sys/kernel/core_pattern
>     /tmp/core.%t.%e.%p
> 
> And I the test is ok:
> 
>     sunny@REDACTED:~$ kill -s SIGSEGV $$
>     Connection to dev-2 closed.
>     sunny@REDACTED:~$ ls /tmp/
>     core.1275730620.bash.12566
> 
> But still no core file generated,when the error occurs.
> 
> Anyway, I attatched the running process by gdb, and here is the result:
> 
>     Program received signal SIGSEGV, Segmentation fault.
>     unlink_free_block (allctr=3D0x7ad480, block=3D0x0) at
>     beam/erl_goodfit_alloc.c:453
>     453        Uint sz =3D BLK_SZ(blk);
>     (gdb) whe
>     #0  unlink_free_block (allctr=3D0x7ad480, block=3D0x0) at
>     beam/erl_goodfit_alloc.c:453
>     #1  0x0000000000437fd6 in get_free_block (allctr=3D0x7ad480,
>     size=3D<value optimized out>, cand_blk=3D0x0, cand_size=3D0)
>         at beam/erl_goodfit_alloc.c:421
>     #2  0x00000000004322c6 in mbc_alloc_block (allctr=3D0x7ad480, size=3D=
> 72)
>     at beam/erl_alloc_util.c:631
>     #3  mbc_alloc (allctr=3D0x7ad480, size=3D72) at beam/erl_alloc_util.c=
> :758
>     #4  0x00000000004b1697 in erts_alloc () at beam/erl_alloc.h:179
>     #5  exit_async () at beam/erl_async.c:132
>     #6  0x000000000043c13d in system_cleanup (exit_code=3D<value optimize=
> d
>     out>) at beam/erl_init.c:1306
>     #7  0x000000000043c443 in erl_exit (n=3D0, fmt=3D0x54649c "") at
>     beam/erl_init.c:1380
>     #8  0x000000000045d042 in halt_0 (A__p=3D<value optimized out>) at
>     beam/bif.c:3319
>     #9  0x00000000004d081f in process_main () at beam/beam_emu.c:2008
>     #10 0x000000000043d56c in erl_start (argc=3D34, argv=3D<value optimiz=
> ed
>     out>) at beam/erl_init.c:1233
>     #11 0x00000000004269b9 in main (argc=3D8049792, argv=3D0x0) at
>     sys/unix/erl_main.c:29
>     (gdb) f 1
>     #1  0x0000000000437fd6 in get_free_block (allctr=3D0x7ad480,
>     size=3D<value optimized out>, cand_blk=3D0x0, cand_size=3D0)
>         at beam/erl_goodfit_alloc.c:421
>     421        unlink_free_block(allctr, blk);
>     (gdb) l 421
>     416        /* We are guaranteed to find a block that fits in this
>     bucket */
>     417        blk =3D search_bucket(allctr, min_bi, size);
>     418        ASSERT(blk);
>     419        if (cand_blk && cand_size <=3D BLK_SZ(blk))
>     420        return NULL; /* cand_blk was better */
>     421        unlink_free_block(allctr, blk);
>     422        return blk;
>     423    }
>     424  =20
>     425  =20
>     (gdb)
> 
> As the running process use the no-debug symbol version beam, I guess the
> ASSERT in line:418 does not work. So I dig in
> 
>     (gdb) p allctr
>     $1 =3D (Allctr_t *) 0x7ad480
>     (gdb) p min_bi
>     $2 =3D <value optimized out>
>     (gdb) p size
>     $3 =3D <value optimized out>
>     (gdb) p *allctr
>     $4 =3D {name_prefix =3D 0x534227 "sl_", alloc_no =3D 3, name =3D {all=
> oc =3D 0,
>     realloc =3D 0, free =3D 0},
>       vsn_str =3D 0x53602f "2.1", t =3D 0, ramv =3D 0, sbc_threshold =3D =
> 524288,
>     sbc_move_threshold =3D 80,
>       mbc_move_threshold =3D 50, main_carrier_size =3D 131072, max_mseg_s=
> bcs
>     =3D 256, max_mseg_mbcs =3D 5,
>       largest_mbc_size =3D 10485760, smallest_mbc_size =3D 1048576,
>     mbc_growth_stages =3D 10, mseg_opt =3D {cache =3D 1,
>         preserv =3D 1, abs_shrink_th =3D 4145152, rel_shrink_th =3D 80},
>     mbc_header_size =3D 32, sbc_header_size =3D 32,
>       min_mbc_size =3D 16384, min_mbc_first_free_size =3D 4096,
>     min_block_size =3D 32, mbc_list =3D {first =3D 0x7f4f93a5d010,
>         last =3D 0x7f4f93a5d010}, sbc_list =3D {first =3D 0x0, last =3D 0=
> x0},
>     main_carrier =3D 0x7f4f93a5d010,
>       get_free_block =3D 0x437f40 <get_free_block>, link_free_block =3D
>     0x437d00 <link_free_block>,
>       unlink_free_block =3D 0x437e10 <unlink_free_block>, info_options =3D=
> 
>     0x438480 <info_options>,
>       get_next_mbc_size =3D 0x430e40 <get_next_mbc_size>, creating_mbc =3D=
> 
>     0x438100 <update_last_aux_mbc>,
>       destroying_mbc =3D 0x438100 <update_last_aux_mbc>, init_atoms =3D
>     0x4385c0 <init_atoms>, mutex =3D {mtx =3D {pt_mtx =3D {
>             __data =3D {__lock =3D 0, __count =3D 0, __owner =3D 0, __nus=
> ers =3D
>     0, __kind =3D 0, __spins =3D 0, __list =3D {
>                 __prev =3D 0x0, __next =3D 0x0}}, __size =3D '\000' <repe=
> ats
>     39 times>, __align =3D 0}, is_rec_mtx =3D 0,
>           prev =3D 0x0, next =3D 0x0}}, thread_safe =3D 0, ts_list =3D {p=
> rev =3D
>     0x0, next =3D 0x0}, atoms_initialized =3D 0,
>       stopped =3D 0, calls =3D {this_alloc =3D {giga_no =3D 0, no =3D 246=
> 0},
>     this_free =3D {giga_no =3D 0, no =3D 2458},
>         this_realloc =3D {giga_no =3D 0, no =3D 0}, mseg_alloc =3D {giga_=
> no =3D 0,
>     no =3D 0}, mseg_dealloc =3D {giga_no =3D 0, no =3D 0},
>         mseg_realloc =3D {giga_no =3D 0, no =3D 0}, sys_alloc =3D {giga_n=
> o =3D 0,
>     no =3D 1}, sys_free =3D {giga_no =3D 0, no =3D 0},
>         sys_realloc =3D {giga_no =3D 0, no =3D 0}}, sbcs =3D {curr_mseg =3D=
>  {no =3D
>     0, size =3D 0}, curr_sys_alloc =3D {no =3D 0,
>           size =3D 0}, max =3D {no =3D 0, size =3D 0}, max_ever =3D {no =3D=
>  0, size
>     =3D 0}, blocks =3D {curr =3D {no =3D 0, size =3D 0},
>           max =3D {no =3D 0, size =3D 0}, max_ever =3D {no =3D 0, size =3D=
>  0}}},
>     mbcs =3D {curr_mseg =3D {no =3D 0, size =3D 0},
>         curr_sys_alloc =3D {no =3D 1, size =3D 131112}, max =3D {no =3D 1=
> , size =3D
>     131112}, max_ever =3D {no =3D 0, size =3D 0},
>         blocks =3D {curr =3D {no =3D 4, size =3D 384}, max =3D {no =3D 14=
> 4, size =3D
>     13848}, max_ever =3D {no =3D 0, size =3D 0}}}}
>     (gdb)
> 
> And stalled here, do you have any advices? and also, any other
> suggestions would be appreciated. TIA.

This shows that in erl_goodfit_alloc.c, the ASSERT(blk) at the
end of get_free_block() is bogus and that unlink_free_blk() can
be invoked with a NULL blk, which will cause a crash.

You should send this to the erlang-bugs mailing list.  It needs
either the attention of someone who is intimately familiar with
the logic of these allocators (I'm not), or for you to make a
self-contained test case available (which you might not be able
to do if it's proprietary).

/Mikael