[erlang-questions] How to track down intermittent segfaults in a threaded NIF

Igor Clark igor.clark@REDACTED
Mon May 28 23:46:56 CEST 2018


Hi folks, hope all well,

I have a NIF which very occasionally segfaults, intermittently and 
apparently unpredictably, bringing down the VM. I've spent a bunch of 
time tracing allocation and dereferencing problems in my NIF code, and 
I've got rid of what seems like 99%+ of the problems - but it still 
occasionally happens, and I'm having trouble tracing further, because 
the crash logs show the crashed threads as doing things like these: 
(each one taken from a separate log where it's the only crashed thread)


> Thread 40 Crashed:: 8_scheduler
> 0   beam.smp                          0x000000001c19980b process_main 
> + 1570
>
> Thread 5 Crashed:: 3_scheduler
> 0   beam.smp                          0x000000001c01d80b process_main 
> + 1570
>
> Thread 7 Crashed:: 5_scheduler
> 0   beam.smp                          0x000000001baff0b8 
> lists_member_2 + 63
>
> Thread 3 Crashed:: 1_scheduler
> 0   beam.smp                          0x000000001d4b780b process_main 
> + 1570
>
> Thread 5 Crashed:: 3_scheduler
> 0   beam.smp                          0x000000001fcf280b process_main 
> + 1570
>
> Thread 6 Crashed:: 4_scheduler
> 0   beam.smp                          0x000000001ae290b8 
> lists_member_2 + 63


I'm very confident that the problems are in my code, not in the 
scheduler ;-) But without more detail, I don't know how to trace where 
they're happening. When they do, there are sometimes other threads doing 
things in my code (maybe 20% of the time) - but mostly not, and on the 
occasions when they are, I've been unable to see what the problem might 
be on the lines referenced.

It seems like it's some kind of cross-thread data access issue, but I 
don't know how to track it down.

Some more context about what's going on. My NIF load() function starts a 
thread which passes a callback function to a library that talks to some 
hardware, which calls the callback when it has a message. It's a 
separate thread because the library only calls back to the thread that 
initialized it; when I ran it directly in NIF load(), it didn't call 
back, but in the VM-managed thread, it works as expected. The thread 
sits and waits for stuff to happen, and callbacks come when they should.

I use enif_thread_create/enif_thread_opts_create to start the thread, 
and use enif_alloc/enif_free everywhere. I keep a static pointer in the 
NIF to a couple of members of the state struct, as that seems the only 
way to reference them in the callback function. The struct is kept in 
NIF private data: I pass **priv from load() to the thread_main function, 
allocate the state struct using enif_alloc in thread_main, and set priv 
pointing to the state struct, also in the thread. Other NIF functions do 
access members of the state struct, but only ever through 
enif_priv_data( env ).

The vast majority of the time it all works perfectly, humming along very 
nicely, but every now and then, without any real pattern I can see, it 
just segfaults and the VM comes down. It's only happened 3 times in the 
last 20+ hours of working on the app, testing & running all the while, 
doing VM starts, stops, code reloads, etc. But when it happens, it's 
kind of a showstopper, and I'd really like to nail it down.

This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM 
version 9.0.0 (clang-900.0.38).

Any ideas on how/where to look next to try to track this down? Hope it's 
not something structural in the above which just won't work.

Cheers,
Igor





More information about the erlang-questions mailing list