[erlang-questions] How to track down intermittent segfaults in a threaded NIF
Igor Clark
igor.clark@REDACTED
Mon May 28 23:46:56 CEST 2018
Hi folks, hope all well,
I have a NIF which very occasionally segfaults, intermittently and
apparently unpredictably, bringing down the VM. I've spent a bunch of
time tracing allocation and dereferencing problems in my NIF code, and
I've got rid of what seems like 99%+ of the problems - but it still
occasionally happens, and I'm having trouble tracing further, because
the crash logs show the crashed threads as doing things like these:
(each one taken from a separate log where it's the only crashed thread)
> Thread 40 Crashed:: 8_scheduler
> 0 beam.smp 0x000000001c19980b process_main
> + 1570
>
> Thread 5 Crashed:: 3_scheduler
> 0 beam.smp 0x000000001c01d80b process_main
> + 1570
>
> Thread 7 Crashed:: 5_scheduler
> 0 beam.smp 0x000000001baff0b8
> lists_member_2 + 63
>
> Thread 3 Crashed:: 1_scheduler
> 0 beam.smp 0x000000001d4b780b process_main
> + 1570
>
> Thread 5 Crashed:: 3_scheduler
> 0 beam.smp 0x000000001fcf280b process_main
> + 1570
>
> Thread 6 Crashed:: 4_scheduler
> 0 beam.smp 0x000000001ae290b8
> lists_member_2 + 63
I'm very confident that the problems are in my code, not in the
scheduler ;-) But without more detail, I don't know how to trace where
they're happening. When they do, there are sometimes other threads doing
things in my code (maybe 20% of the time) - but mostly not, and on the
occasions when they are, I've been unable to see what the problem might
be on the lines referenced.
It seems like it's some kind of cross-thread data access issue, but I
don't know how to track it down.
Some more context about what's going on. My NIF load() function starts a
thread which passes a callback function to a library that talks to some
hardware, which calls the callback when it has a message. It's a
separate thread because the library only calls back to the thread that
initialized it; when I ran it directly in NIF load(), it didn't call
back, but in the VM-managed thread, it works as expected. The thread
sits and waits for stuff to happen, and callbacks come when they should.
I use enif_thread_create/enif_thread_opts_create to start the thread,
and use enif_alloc/enif_free everywhere. I keep a static pointer in the
NIF to a couple of members of the state struct, as that seems the only
way to reference them in the callback function. The struct is kept in
NIF private data: I pass **priv from load() to the thread_main function,
allocate the state struct using enif_alloc in thread_main, and set priv
pointing to the state struct, also in the thread. Other NIF functions do
access members of the state struct, but only ever through
enif_priv_data( env ).
The vast majority of the time it all works perfectly, humming along very
nicely, but every now and then, without any real pattern I can see, it
just segfaults and the VM comes down. It's only happened 3 times in the
last 20+ hours of working on the app, testing & running all the while,
doing VM starts, stops, code reloads, etc. But when it happens, it's
kind of a showstopper, and I'd really like to nail it down.
This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM
version 9.0.0 (clang-900.0.38).
Any ideas on how/where to look next to try to track this down? Hope it's
not something structural in the above which just won't work.
Cheers,
Igor
More information about the erlang-questions
mailing list