[erlang-questions] How to track down intermittent segfaults in a threaded NIF

Tue May 29 13:30:42 CEST 2018

Thanks very much Lukas, I think the debug emulator could be what I'm 
looking for. The NIF only sometimes crashes on lists:member/2 - those 
log lines are all from different crashes (there's only one crashed 
thread each time), and sometimes it just crashes on process_main. So I 
think I might need the debug emulator to trace further.

However I have a lot to learn about how to integrate C tooling with 
something so complex. When I run the debug emulator, does it just show 
more detailed info in stack traces, or will I need to attach gdb/lldb 
etc to find out what's going on? Is there any more info on how to set 
this all up?

Also, not 100% sure how to run it, as I run my app with "rebar3 shell" 
from a release layout during development, or the same inside the 
NIF-specific app when trying to track problems down there. The doc you 
linked says:

> To start the debug enabled runtime system execute:
>
> |$ $ERL_TOP/bin/cerl -debug|

I realise these are more rebar3 than erlang questions, but I can't find 
much in the rebar3 docs about them:

- How should I specify that rebar3 should run "cerl" instead of "erl" ?

- Should I just add "-debug" in my "config/vm.args" or is there another 
way to do this?

Thank you for your help!
i

On 29/05/2018 11:30, Lukas Larsson wrote:
> Have you tried to run your code in a debug emulator? 
> https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system 
>
>
> Since it seems to be segfaulting in lists:member/2, I would guess that 
> your nif somehow builds an invalid list that later is used by 
> lists:member/2.
>
> On Tue, May 29, 2018 at 11:04 AM, Igor Clark <igor.clark@REDACTED 
> <mailto:igor.clark@REDACTED>> wrote:
>
>     Thanks Sergej - that's where I got the thread reports I pasted in
>     below, from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.
>
>     Each log says the only crashed thread was a scheduler thread, for
>     example "8_scheduler" running "process_main" in the case of the
>     first one below. This is how I tracked down a bunch of errors in
>     my own code, but the only ones that still happen are in the
>     scheduler, according to the Console crash logs.
>
>     The thing is, it seems really unlikely that a VM running my NIF
>     code would just happen to be crashing in the scheduler rather than
>     my code(!) - so that's what I'm trying to work out, how to find
>     out what's actually going on, given that the log tells me the
>     crashed thread is running "process_main" or 'lists_member_2'.
>
>     Any suggestions welcome!
>
>     Cheers,
>     Igor
>
>
>     On 29/05/2018 04:16, Sergej Jurečko wrote:
>
>         On macOS there is a quick way to get a stack trace if you
>         compiled with debug symbols.
>         Open /Applications/Utilities/Console
>         Go to: User Reports
>
>         You will see beam.smp in there if it crashed. Click on it and
>         you get a report what every thread was calling at the time of
>         crash.
>
>
>         Regards,
>         Sergej
>
>             On 28 May 2018, at 23:46, Igor Clark <igor.clark@REDACTED
>             <mailto:igor.clark@REDACTED>> wrote:
>
>             Hi folks, hope all well,
>
>             I have a NIF which very occasionally segfaults,
>             intermittently and apparently unpredictably, bringing down
>             the VM. I've spent a bunch of time tracing allocation and
>             dereferencing problems in my NIF code, and I've got rid of
>             what seems like 99%+ of the problems - but it still
>             occasionally happens, and I'm having trouble tracing
>             further, because the crash logs show the crashed threads
>             as doing things like these: (each one taken from a
>             separate log where it's the only crashed thread)
>
>
>                 Thread 40 Crashed:: 8_scheduler
>                 0   beam.smp 0x000000001c19980b process_main + 1570
>
>                 Thread 5 Crashed:: 3_scheduler
>                 0   beam.smp 0x000000001c01d80b process_main + 1570
>
>                 Thread 7 Crashed:: 5_scheduler
>                 0   beam.smp 0x000000001baff0b8 lists_member_2 + 63
>
>                 Thread 3 Crashed:: 1_scheduler
>                 0   beam.smp 0x000000001d4b780b process_main + 1570
>
>                 Thread 5 Crashed:: 3_scheduler
>                 0   beam.smp 0x000000001fcf280b process_main + 1570
>
>                 Thread 6 Crashed:: 4_scheduler
>                 0   beam.smp 0x000000001ae290b8 lists_member_2 + 63
>
>
>             I'm very confident that the problems are in my code, not
>             in the scheduler ;-) But without more detail, I don't know
>             how to trace where they're happening. When they do, there
>             are sometimes other threads doing things in my code (maybe
>             20% of the time) - but mostly not, and on the occasions
>             when they are, I've been unable to see what the problem
>             might be on the lines referenced.
>
>             It seems like it's some kind of cross-thread data access
>             issue, but I don't know how to track it down.
>
>             Some more context about what's going on. My NIF load()
>             function starts a thread which passes a callback function
>             to a library that talks to some hardware, which calls the
>             callback when it has a message. It's a separate thread
>             because the library only calls back to the thread that
>             initialized it; when I ran it directly in NIF load(), it
>             didn't call back, but in the VM-managed thread, it works
>             as expected. The thread sits and waits for stuff to
>             happen, and callbacks come when they should.
>
>             I use enif_thread_create/enif_thread_opts_create to start
>             the thread, and use enif_alloc/enif_free everywhere. I
>             keep a static pointer in the NIF to a couple of members of
>             the state struct, as that seems the only way to reference
>             them in the callback function. The struct is kept in NIF
>             private data: I pass **priv from load() to the thread_main
>             function, allocate the state struct using enif_alloc in
>             thread_main, and set priv pointing to the state struct,
>             also in the thread. Other NIF functions do access members
>             of the state struct, but only ever through enif_priv_data(
>             env ).
>
>             The vast majority of the time it all works perfectly,
>             humming along very nicely, but every now and then, without
>             any real pattern I can see, it just segfaults and the VM
>             comes down. It's only happened 3 times in the last 20+
>             hours of working on the app, testing & running all the
>             while, doing VM starts, stops, code reloads, etc. But when
>             it happens, it's kind of a showstopper, and I'd really
>             like to nail it down.
>
>             This is all happening in Erlang 20.3.4 on MacOS 10.12.6 /
>             Apple LLVM version 9.0.0 (clang-900.0.38).
>
>             Any ideas on how/where to look next to try to track this
>             down? Hope it's not something structural in the above
>             which just won't work.
>
>             Cheers,
>             Igor
>
>
>             _______________________________________________
>             erlang-questions mailing list
>             erlang-questions@REDACTED
>             <mailto:erlang-questions@REDACTED>
>             http://erlang.org/mailman/listinfo/erlang-questions
>             <http://erlang.org/mailman/listinfo/erlang-questions>
>
>
>     _______________________________________________
>     erlang-questions mailing list
>     erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>     http://erlang.org/mailman/listinfo/erlang-questions
>     <http://erlang.org/mailman/listinfo/erlang-questions>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180529/c29c28b2/attachment.htm>