[erlang-questions] How to track down intermittent segfaults in a threaded NIF
Igor Clark
igor.clark@REDACTED
Tue May 29 13:30:42 CEST 2018
Thanks very much Lukas, I think the debug emulator could be what I'm
looking for. The NIF only sometimes crashes on lists:member/2 - those
log lines are all from different crashes (there's only one crashed
thread each time), and sometimes it just crashes on process_main. So I
think I might need the debug emulator to trace further.
However I have a lot to learn about how to integrate C tooling with
something so complex. When I run the debug emulator, does it just show
more detailed info in stack traces, or will I need to attach gdb/lldb
etc to find out what's going on? Is there any more info on how to set
this all up?
Also, not 100% sure how to run it, as I run my app with "rebar3 shell"
from a release layout during development, or the same inside the
NIF-specific app when trying to track problems down there. The doc you
linked says:
> To start the debug enabled runtime system execute:
>
> |$ $ERL_TOP/bin/cerl -debug|
I realise these are more rebar3 than erlang questions, but I can't find
much in the rebar3 docs about them:
- How should I specify that rebar3 should run "cerl" instead of "erl" ?
- Should I just add "-debug" in my "config/vm.args" or is there another
way to do this?
Thank you for your help!
i
On 29/05/2018 11:30, Lukas Larsson wrote:
> Have you tried to run your code in a debug emulator?
> https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system
>
>
> Since it seems to be segfaulting in lists:member/2, I would guess that
> your nif somehow builds an invalid list that later is used by
> lists:member/2.
>
> On Tue, May 29, 2018 at 11:04 AM, Igor Clark <igor.clark@REDACTED
> <mailto:igor.clark@REDACTED>> wrote:
>
> Thanks Sergej - that's where I got the thread reports I pasted in
> below, from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.
>
> Each log says the only crashed thread was a scheduler thread, for
> example "8_scheduler" running "process_main" in the case of the
> first one below. This is how I tracked down a bunch of errors in
> my own code, but the only ones that still happen are in the
> scheduler, according to the Console crash logs.
>
> The thing is, it seems really unlikely that a VM running my NIF
> code would just happen to be crashing in the scheduler rather than
> my code(!) - so that's what I'm trying to work out, how to find
> out what's actually going on, given that the log tells me the
> crashed thread is running "process_main" or 'lists_member_2'.
>
> Any suggestions welcome!
>
> Cheers,
> Igor
>
>
> On 29/05/2018 04:16, Sergej Jurečko wrote:
>
> On macOS there is a quick way to get a stack trace if you
> compiled with debug symbols.
> Open /Applications/Utilities/Console
> Go to: User Reports
>
> You will see beam.smp in there if it crashed. Click on it and
> you get a report what every thread was calling at the time of
> crash.
>
>
> Regards,
> Sergej
>
> On 28 May 2018, at 23:46, Igor Clark <igor.clark@REDACTED
> <mailto:igor.clark@REDACTED>> wrote:
>
> Hi folks, hope all well,
>
> I have a NIF which very occasionally segfaults,
> intermittently and apparently unpredictably, bringing down
> the VM. I've spent a bunch of time tracing allocation and
> dereferencing problems in my NIF code, and I've got rid of
> what seems like 99%+ of the problems - but it still
> occasionally happens, and I'm having trouble tracing
> further, because the crash logs show the crashed threads
> as doing things like these: (each one taken from a
> separate log where it's the only crashed thread)
>
>
> Thread 40 Crashed:: 8_scheduler
> 0 beam.smp 0x000000001c19980b process_main + 1570
>
> Thread 5 Crashed:: 3_scheduler
> 0 beam.smp 0x000000001c01d80b process_main + 1570
>
> Thread 7 Crashed:: 5_scheduler
> 0 beam.smp 0x000000001baff0b8 lists_member_2 + 63
>
> Thread 3 Crashed:: 1_scheduler
> 0 beam.smp 0x000000001d4b780b process_main + 1570
>
> Thread 5 Crashed:: 3_scheduler
> 0 beam.smp 0x000000001fcf280b process_main + 1570
>
> Thread 6 Crashed:: 4_scheduler
> 0 beam.smp 0x000000001ae290b8 lists_member_2 + 63
>
>
> I'm very confident that the problems are in my code, not
> in the scheduler ;-) But without more detail, I don't know
> how to trace where they're happening. When they do, there
> are sometimes other threads doing things in my code (maybe
> 20% of the time) - but mostly not, and on the occasions
> when they are, I've been unable to see what the problem
> might be on the lines referenced.
>
> It seems like it's some kind of cross-thread data access
> issue, but I don't know how to track it down.
>
> Some more context about what's going on. My NIF load()
> function starts a thread which passes a callback function
> to a library that talks to some hardware, which calls the
> callback when it has a message. It's a separate thread
> because the library only calls back to the thread that
> initialized it; when I ran it directly in NIF load(), it
> didn't call back, but in the VM-managed thread, it works
> as expected. The thread sits and waits for stuff to
> happen, and callbacks come when they should.
>
> I use enif_thread_create/enif_thread_opts_create to start
> the thread, and use enif_alloc/enif_free everywhere. I
> keep a static pointer in the NIF to a couple of members of
> the state struct, as that seems the only way to reference
> them in the callback function. The struct is kept in NIF
> private data: I pass **priv from load() to the thread_main
> function, allocate the state struct using enif_alloc in
> thread_main, and set priv pointing to the state struct,
> also in the thread. Other NIF functions do access members
> of the state struct, but only ever through enif_priv_data(
> env ).
>
> The vast majority of the time it all works perfectly,
> humming along very nicely, but every now and then, without
> any real pattern I can see, it just segfaults and the VM
> comes down. It's only happened 3 times in the last 20+
> hours of working on the app, testing & running all the
> while, doing VM starts, stops, code reloads, etc. But when
> it happens, it's kind of a showstopper, and I'd really
> like to nail it down.
>
> This is all happening in Erlang 20.3.4 on MacOS 10.12.6 /
> Apple LLVM version 9.0.0 (clang-900.0.38).
>
> Any ideas on how/where to look next to try to track this
> down? Hope it's not something structural in the above
> which just won't work.
>
> Cheers,
> Igor
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> <mailto:erlang-questions@REDACTED>
> http://erlang.org/mailman/listinfo/erlang-questions
> <http://erlang.org/mailman/listinfo/erlang-questions>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
> http://erlang.org/mailman/listinfo/erlang-questions
> <http://erlang.org/mailman/listinfo/erlang-questions>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180529/c29c28b2/attachment.htm>
More information about the erlang-questions
mailing list