[erlang-questions] How to track down intermittent segfaults in a threaded NIF

Tue May 29 16:08:02 CEST 2018

Thanks Lukas and Peti, that's great. "erl -emu_type debug" definitely 
works - I haven't made the debug build yet but I get "erlexec: The 
emulator 
'/usr/local/Cellar/erlang/20.3.4/lib/erlang/erts-9.3/bin/beam.debug.smp' 
does not exist", which is what I want. I'll get onto the debug build and 
see what I can find out.

In case anyone else wants to use that in rebar3 shell, I found 
http://www.rebar3.org/v3.0/discuss/5745fb105528582000dfb47f which shows 
you can set ERL_FLAGS to just set -emu_type directly, or specify vm.args 
so you can then set it in there, e.g.:

> ERL_FLAGS=" -args_file config/vm.args -config config/sys.config" 
> rebar3 shell

Cheers,
Igor

On 29/05/2018 14:09, Peti Gömöri wrote:
> since OTP 20 the *-emu_type* flag might also work eg.:
>   erl -emu_type debug
>
> and you can put it in the vm.args file too
>
> On Tue, May 29, 2018 at 2:45 PM, Lukas Larsson <lukas@REDACTED 
> <mailto:lukas@REDACTED>> wrote:
>
>     I don't know how to make rebar3 run the debug emulator, but a
>     quick and dirty trick that I do when all else fails is to copy the
>     beam.debug.smp file over the beam.smp file.
>
>     You probably also have to copy the erl_child_setup.debug file,
>     that file should however have the .debug suffix remaining. So:
>
>     cp bin/`erts/autoconf/config.guess`/beam.debug.smp
>     path/to/release/erts-v.s.n/bin/beam.smp
>     cp bin/`erts/autoconf/config.guess`/erl_child_setup.debug
>     path/to/release/erts-v.s.n/bin/
>
>
>     On Tue, May 29, 2018 at 1:30 PM, Igor Clark <igor.clark@REDACTED
>     <mailto:igor.clark@REDACTED>> wrote:
>
>         Thanks very much Lukas, I think the debug emulator could be
>         what I'm looking for. The NIF only sometimes crashes on
>         lists:member/2 - those log lines are all from different
>         crashes (there's only one crashed thread each time), and
>         sometimes it just crashes on process_main. So I think I might
>         need the debug emulator to trace further.
>
>         However I have a lot to learn about how to integrate C tooling
>         with something so complex. When I run the debug emulator, does
>         it just show more detailed info in stack traces, or will I
>         need to attach gdb/lldb etc to find out what's going on? Is
>         there any more info on how to set this all up?
>
>         Also, not 100% sure how to run it, as I run my app with
>         "rebar3 shell" from a release layout during development, or
>         the same inside the NIF-specific app when trying to track
>         problems down there. The doc you linked says:
>
>>         To start the debug enabled runtime system execute:
>>
>>         |$ $ERL_TOP/bin/cerl -debug|
>
>         I realise these are more rebar3 than erlang questions, but I
>         can't find much in the rebar3 docs about them:
>
>         - How should I specify that rebar3 should run "cerl" instead
>         of "erl" ?
>
>         - Should I just add "-debug" in my "config/vm.args" or is
>         there another way to do this?
>
>         Thank you for your help!
>         i
>
>
>         On 29/05/2018 11:30, Lukas Larsson wrote:
>>         Have you tried to run your code in a debug emulator?
>>         https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system
>>         <https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system>
>>
>>
>>         Since it seems to be segfaulting in lists:member/2, I would
>>         guess that your nif somehow builds an invalid list that later
>>         is used by lists:member/2.
>>
>>         On Tue, May 29, 2018 at 11:04 AM, Igor Clark
>>         <igor.clark@REDACTED <mailto:igor.clark@REDACTED>> wrote:
>>
>>             Thanks Sergej - that's where I got the thread reports I
>>             pasted in below, from e.g.
>>             'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.
>>
>>             Each log says the only crashed thread was a scheduler
>>             thread, for example "8_scheduler" running "process_main"
>>             in the case of the first one below. This is how I tracked
>>             down a bunch of errors in my own code, but the only ones
>>             that still happen are in the scheduler, according to the
>>             Console crash logs.
>>
>>             The thing is, it seems really unlikely that a VM running
>>             my NIF code would just happen to be crashing in the
>>             scheduler rather than my code(!) - so that's what I'm
>>             trying to work out, how to find out what's actually going
>>             on, given that the log tells me the crashed thread is
>>             running "process_main" or 'lists_member_2'.
>>
>>             Any suggestions welcome!
>>
>>             Cheers,
>>             Igor
>>
>>
>>             On 29/05/2018 04:16, Sergej Jurečko wrote:
>>
>>                 On macOS there is a quick way to get a stack trace if
>>                 you compiled with debug symbols.
>>                 Open /Applications/Utilities/Console
>>                 Go to: User Reports
>>
>>                 You will see beam.smp in there if it crashed. Click
>>                 on it and you get a report what every thread was
>>                 calling at the time of crash.
>>
>>
>>                 Regards,
>>                 Sergej
>>
>>                     On 28 May 2018, at 23:46, Igor Clark
>>                     <igor.clark@REDACTED
>>                     <mailto:igor.clark@REDACTED>> wrote:
>>
>>                     Hi folks, hope all well,
>>
>>                     I have a NIF which very occasionally segfaults,
>>                     intermittently and apparently unpredictably,
>>                     bringing down the VM. I've spent a bunch of time
>>                     tracing allocation and dereferencing problems in
>>                     my NIF code, and I've got rid of what seems like
>>                     99%+ of the problems - but it still occasionally
>>                     happens, and I'm having trouble tracing further,
>>                     because the crash logs show the crashed threads
>>                     as doing things like these: (each one taken from
>>                     a separate log where it's the only crashed thread)
>>
>>
>>                         Thread 40 Crashed:: 8_scheduler
>>                         0   beam.smp 0x000000001c19980b process_main
>>                         + 1570
>>
>>                         Thread 5 Crashed:: 3_scheduler
>>                         0   beam.smp 0x000000001c01d80b process_main
>>                         + 1570
>>
>>                         Thread 7 Crashed:: 5_scheduler
>>                         0   beam.smp 0x000000001baff0b8
>>                         lists_member_2 + 63
>>
>>                         Thread 3 Crashed:: 1_scheduler
>>                         0   beam.smp 0x000000001d4b780b process_main
>>                         + 1570
>>
>>                         Thread 5 Crashed:: 3_scheduler
>>                         0   beam.smp 0x000000001fcf280b process_main
>>                         + 1570
>>
>>                         Thread 6 Crashed:: 4_scheduler
>>                         0   beam.smp 0x000000001ae290b8
>>                         lists_member_2 + 63
>>
>>
>>                     I'm very confident that the problems are in my
>>                     code, not in the scheduler ;-) But without more
>>                     detail, I don't know how to trace where they're
>>                     happening. When they do, there are sometimes
>>                     other threads doing things in my code (maybe 20%
>>                     of the time) - but mostly not, and on the
>>                     occasions when they are, I've been unable to see
>>                     what the problem might be on the lines referenced.
>>
>>                     It seems like it's some kind of cross-thread data
>>                     access issue, but I don't know how to track it down.
>>
>>                     Some more context about what's going on. My NIF
>>                     load() function starts a thread which passes a
>>                     callback function to a library that talks to some
>>                     hardware, which calls the callback when it has a
>>                     message. It's a separate thread because the
>>                     library only calls back to the thread that
>>                     initialized it; when I ran it directly in NIF
>>                     load(), it didn't call back, but in the
>>                     VM-managed thread, it works as expected. The
>>                     thread sits and waits for stuff to happen, and
>>                     callbacks come when they should.
>>
>>                     I use enif_thread_create/enif_thread_opts_create
>>                     to start the thread, and use enif_alloc/enif_free
>>                     everywhere. I keep a static pointer in the NIF to
>>                     a couple of members of the state struct, as that
>>                     seems the only way to reference them in the
>>                     callback function. The struct is kept in NIF
>>                     private data: I pass **priv from load() to the
>>                     thread_main function, allocate the state struct
>>                     using enif_alloc in thread_main, and set priv
>>                     pointing to the state struct, also in the thread.
>>                     Other NIF functions do access members of the
>>                     state struct, but only ever through
>>                     enif_priv_data( env ).
>>
>>                     The vast majority of the time it all works
>>                     perfectly, humming along very nicely, but every
>>                     now and then, without any real pattern I can see,
>>                     it just segfaults and the VM comes down. It's
>>                     only happened 3 times in the last 20+ hours of
>>                     working on the app, testing & running all the
>>                     while, doing VM starts, stops, code reloads, etc.
>>                     But when it happens, it's kind of a showstopper,
>>                     and I'd really like to nail it down.
>>
>>                     This is all happening in Erlang 20.3.4 on MacOS
>>                     10.12.6 / Apple LLVM version 9.0.0 (clang-900.0.38).
>>
>>                     Any ideas on how/where to look next to try to
>>                     track this down? Hope it's not something
>>                     structural in the above which just won't work.
>>
>>                     Cheers,
>>                     Igor
>>
>>
>>                     _______________________________________________
>>                     erlang-questions mailing list
>>                     erlang-questions@REDACTED
>>                     <mailto:erlang-questions@REDACTED>
>>                     http://erlang.org/mailman/listinfo/erlang-questions
>>                     <http://erlang.org/mailman/listinfo/erlang-questions>
>>
>>
>>             _______________________________________________
>>             erlang-questions mailing list
>>             erlang-questions@REDACTED
>>             <mailto:erlang-questions@REDACTED>
>>             http://erlang.org/mailman/listinfo/erlang-questions
>>             <http://erlang.org/mailman/listinfo/erlang-questions>
>>
>>
>
>
>         _______________________________________________
>         erlang-questions mailing list
>         erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>         http://erlang.org/mailman/listinfo/erlang-questions
>         <http://erlang.org/mailman/listinfo/erlang-questions>
>
>
>
>     _______________________________________________
>     erlang-questions mailing list
>     erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>     http://erlang.org/mailman/listinfo/erlang-questions
>     <http://erlang.org/mailman/listinfo/erlang-questions>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180529/7f60c352/attachment.htm>