[erlang-questions] How to track down intermittent segfaults in a threaded NIF

Tue May 29 11:04:22 CEST 2018

Thanks Sergej - that's where I got the thread reports I pasted in below, 
from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.

Each log says the only crashed thread was a scheduler thread, for 
example "8_scheduler" running "process_main" in the case of the first 
one below. This is how I tracked down a bunch of errors in my own code, 
but the only ones that still happen are in the scheduler, according to 
the Console crash logs.

The thing is, it seems really unlikely that a VM running my NIF code 
would just happen to be crashing in the scheduler rather than my code(!) 
- so that's what I'm trying to work out, how to find out what's actually 
going on, given that the log tells me the crashed thread is running 
"process_main" or 'lists_member_2'.

Any suggestions welcome!

Cheers,
Igor

On 29/05/2018 04:16, Sergej Jurečko wrote:
> On macOS there is a quick way to get a stack trace if you compiled with debug symbols.
> Open /Applications/Utilities/Console
> Go to: User Reports
>
> You will see beam.smp in there if it crashed. Click on it and you get a report what every thread was calling at the time of crash.
>
>
> Regards,
> Sergej
>
>> On 28 May 2018, at 23:46, Igor Clark <igor.clark@REDACTED> wrote:
>>
>> Hi folks, hope all well,
>>
>> I have a NIF which very occasionally segfaults, intermittently and apparently unpredictably, bringing down the VM. I've spent a bunch of time tracing allocation and dereferencing problems in my NIF code, and I've got rid of what seems like 99%+ of the problems - but it still occasionally happens, and I'm having trouble tracing further, because the crash logs show the crashed threads as doing things like these: (each one taken from a separate log where it's the only crashed thread)
>>
>>
>>> Thread 40 Crashed:: 8_scheduler
>>> 0   beam.smp                          0x000000001c19980b process_main + 1570
>>>
>>> Thread 5 Crashed:: 3_scheduler
>>> 0   beam.smp                          0x000000001c01d80b process_main + 1570
>>>
>>> Thread 7 Crashed:: 5_scheduler
>>> 0   beam.smp                          0x000000001baff0b8 lists_member_2 + 63
>>>
>>> Thread 3 Crashed:: 1_scheduler
>>> 0   beam.smp                          0x000000001d4b780b process_main + 1570
>>>
>>> Thread 5 Crashed:: 3_scheduler
>>> 0   beam.smp                          0x000000001fcf280b process_main + 1570
>>>
>>> Thread 6 Crashed:: 4_scheduler
>>> 0   beam.smp                          0x000000001ae290b8 lists_member_2 + 63
>>
>> I'm very confident that the problems are in my code, not in the scheduler ;-) But without more detail, I don't know how to trace where they're happening. When they do, there are sometimes other threads doing things in my code (maybe 20% of the time) - but mostly not, and on the occasions when they are, I've been unable to see what the problem might be on the lines referenced.
>>
>> It seems like it's some kind of cross-thread data access issue, but I don't know how to track it down.
>>
>> Some more context about what's going on. My NIF load() function starts a thread which passes a callback function to a library that talks to some hardware, which calls the callback when it has a message. It's a separate thread because the library only calls back to the thread that initialized it; when I ran it directly in NIF load(), it didn't call back, but in the VM-managed thread, it works as expected. The thread sits and waits for stuff to happen, and callbacks come when they should.
>>
>> I use enif_thread_create/enif_thread_opts_create to start the thread, and use enif_alloc/enif_free everywhere. I keep a static pointer in the NIF to a couple of members of the state struct, as that seems the only way to reference them in the callback function. The struct is kept in NIF private data: I pass **priv from load() to the thread_main function, allocate the state struct using enif_alloc in thread_main, and set priv pointing to the state struct, also in the thread. Other NIF functions do access members of the state struct, but only ever through enif_priv_data( env ).
>>
>> The vast majority of the time it all works perfectly, humming along very nicely, but every now and then, without any real pattern I can see, it just segfaults and the VM comes down. It's only happened 3 times in the last 20+ hours of working on the app, testing & running all the while, doing VM starts, stops, code reloads, etc. But when it happens, it's kind of a showstopper, and I'd really like to nail it down.
>>
>> This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM version 9.0.0 (clang-900.0.38).
>>
>> Any ideas on how/where to look next to try to track this down? Hope it's not something structural in the above which just won't work.
>>
>> Cheers,
>> Igor
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions