[erlang-questions] segfault erts-5.10.4 (R16B03-1)

Tue Sep 8 12:11:38 CEST 2015

Hello,

Unfortunately that did not help, it just made some more arguments
available. I was hoping that it would give a full stack. If you do "p
allctr->name_prefix" you can get to know which allocator it is that is miss
behaving. I'm guessing that it will be "driver_", which is any allocations
done in nifs+linked-in drivers. Without a full stacktrace it will be hard
to figure out what is wrong.

One thing that you could do it to use the etp gdb macros distributed with
the Erlang/OTP source code. If you in gdb do "source
$ERL_TOP/erts/etc/unix/etp-commands" (replacing $ERL_TOP with the path to
the source of R16B03-1 Erlang/OTP from github. It is only this file that is
needed, so if you need to copy this onto a server somewhere you only need
this file) you will get access to a lot of helpful gdb macros.

If you then do "etp-ports" you will get printed to the shell all ports that
are alive at the moment of the crash. Look for any ports with a state that
looks different. That will most likely be the port that is just executing.
e.g. for me this is a currently running port:

  Pix: 2576
  Port: #Port<0.322>
  Name: tty_sl -c -e
  State: connected soft-eof
  Scheduler flags: GARBAGE
  Connected: <0.25.0>
  Pointer: (Port *) 0x7ffff54809d8

to get the name of the currently running driver do "p ((Port
*)0x7ffff54809d8)->drv_ptr->name".

If no port is executing, it might be a nif, then you can do "etp-processes"
to get a list of all processes in the system. Again look for any state that
looks different. e.g.

  Pix: 200
  Pid: <0.25.0>
  State: trapping-exit | running | active | prq-prio-normal |
usr-prio-normal | act-prio-normal
  Registered name: user_drv
  I: #Cp<user_drv:io_command/1+0x520>
  Heap size: 610
  Old-heap size: 987
  Mbuf size: 0
  Msgq len: 0 (inner=0, outer=0)
  Parent: <0.24.0>
  Pointer: (Process *) 0x7ffff51c4df0

This is a currently running process (State: running). You can get a
stackdump of the process by doing: etp-stackdump ((Process*)0x7ffff51c4df0)

Note that etp-processes and etp-ports will take quite some time to run
until they finish. They need to iterate over all possibly processes/ports,
and gdb is not the fastest scripting language in the world.

For some help with the etp gdb commands you can issue "etp-help".

Happy hunting!
Lukas

On Tue, Sep 8, 2015 at 11:45 AM, Ahmed Omar <spawn.think@REDACTED> wrote:

> Hi Lukas,
> Thanks for your reply. I tried with the latest version of gdb (7.10) :
>
> ###
> (gdb) bt full
> #0  0x000000000044d299 in link_free_block (allctr=0x15e32c0, block=0x128)
> at beam/erl_goodfit_alloc.c:439
>         gfallctr = 0x15e32c0
>         blk = 0x128
>         sz = 0
>         i = <optimized out>
> #1  0x00000000015e32c0 in ?? ()
> No symbol table info available.
> #2  0x0000000000442fa6 in mbc_realloc (allctr=0x7fe0848807a8, p=0x11f,
> size=<optimized out>, busy_pcrr_pp=0x8, alcu_flgs=0) at
> beam/erl_alloc_util.c:2370
>         crr = 0x128
>         new_p = <optimized out>
>         old_blk_sz = 287
>         blk = 0x117
>         new_blk = <optimized out>
>         cand_blk = <optimized out>
>         cand_blk_sz = <optimized out>
>         blk_sz = 3748409
>         nxt_blk = 0x236
>         nxt_blk_sz = 22950592
>         is_last_blk = 296
>         get_blk_sz = 140602277246336
> #3  0x0000000000000000 in ?? ()
> No symbol table info available.
> ###
>
> Best Regards,
> - Ahmed Omar
> http://about.me/spawn.think/
>
> 2015-09-08 10:51 GMT+02:00 Lukas Larsson <garazdawi@REDACTED>:
>
>> Hello,
>>
>> On Tue, Sep 8, 2015 at 10:33 AM, Ahmed Omar <spawn.think@REDACTED>
>> wrote:
>>
>>> Hi,
>>> We have been experiencing a segfault on our servers running a custom
>>> version of Ejabberd. We managed to get a core file from the last crash
>>> This is what we see running gdb on it:
>>> ######
>>> Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /lib64/ld-linux-x86-64.so.2
>>> Core was generated by `/var/lib/ejabberd/erts-5.10.4/bin/beam.smp -K
>>> true -A 128 -P 2500000 -Q 500000'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  0x000000000044d299 in link_free_block (allctr=0x15e32c0,
>>> block=0x128) at beam/erl_goodfit_alloc.c:439
>>> 439 beam/erl_goodfit_alloc.c: No such file or directory.
>>> in beam/erl_goodfit_alloc.c
>>> ######
>>>
>>> If we run bt full in gdb we get:
>>> ######
>>> (gdb) bt full
>>> #0  0x000000000044d299 in link_free_block (allctr=0x15e32c0,
>>> block=0x128) at beam/erl_goodfit_alloc.c:439
>>>         gfallctr = 0x15e32c0
>>>         blk = 0x128
>>>         sz = 0
>>>         i = <value optimized out>
>>> #1  0x00000000015e32c0 in ?? ()
>>> No symbol table info available.
>>> #2  0x0000000000442fa6 in mbc_realloc (allctr=0x7fe0848807a8, p=0x11f,
>>> size=Unhandled dwarf expression opcode 0xf3
>>> ) at beam/erl_alloc_util.c:2370
>>>         crr = 0x128
>>>         new_p = <value optimized out>
>>>         old_blk_sz = 287
>>>         blk = 0x117
>>>         new_blk = <value optimized out>
>>>         cand_blk = <value optimized out>
>>>         cand_blk_sz = <value optimized out>
>>>         blk_sz = 3748409
>>>         nxt_blk = 0x236
>>>         nxt_blk_sz = 22950592
>>>         is_last_blk = 296
>>>         get_blk_sz = 140602277246336
>>> #3  0x0000000000000000 in ?? ()
>>> No symbol table info available.
>>> #######
>>>
>>> Is there a way to get more information? maybe which driver made the
>>> realloc call?
>>>
>>
>> Something is wrong/missing from this stacktrace. The gdb that you are
>> using does not seem to understand the dwarf2 extension (at least that's
>> what I guess after googling "Unhandled dwarf expression opcode 0xf3"), and
>> can only find two of the frames. Try to install a later version of gdb and
>> then do a bt full.
>>
>>
>>>
>>> Best Regards,
>>> - Ahmed Omar
>>> http://about.me/spawn.think/
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150908/1ecd97b6/attachment.htm>