[erlang-questions] How to track down intermittent segfaults in a threaded NIF

Tue May 29 12:30:16 CEST 2018

Have you tried to run your code in a debug emulator?
https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system

Since it seems to be segfaulting in lists:member/2, I would guess that your
nif somehow builds an invalid list that later is used by lists:member/2.

On Tue, May 29, 2018 at 11:04 AM, Igor Clark <igor.clark@REDACTED> wrote:

> Thanks Sergej - that's where I got the thread reports I pasted in below,
> from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.
>
> Each log says the only crashed thread was a scheduler thread, for example
> "8_scheduler" running "process_main" in the case of the first one below.
> This is how I tracked down a bunch of errors in my own code, but the only
> ones that still happen are in the scheduler, according to the Console crash
> logs.
>
> The thing is, it seems really unlikely that a VM running my NIF code would
> just happen to be crashing in the scheduler rather than my code(!) - so
> that's what I'm trying to work out, how to find out what's actually going
> on, given that the log tells me the crashed thread is running
> "process_main" or 'lists_member_2'.
>
> Any suggestions welcome!
>
> Cheers,
> Igor
>
>
> On 29/05/2018 04:16, Sergej Jurečko wrote:
>
>> On macOS there is a quick way to get a stack trace if you compiled with
>> debug symbols.
>> Open /Applications/Utilities/Console
>> Go to: User Reports
>>
>> You will see beam.smp in there if it crashed. Click on it and you get a
>> report what every thread was calling at the time of crash.
>>
>>
>> Regards,
>> Sergej
>>
>> On 28 May 2018, at 23:46, Igor Clark <igor.clark@REDACTED> wrote:
>>>
>>> Hi folks, hope all well,
>>>
>>> I have a NIF which very occasionally segfaults, intermittently and
>>> apparently unpredictably, bringing down the VM. I've spent a bunch of time
>>> tracing allocation and dereferencing problems in my NIF code, and I've got
>>> rid of what seems like 99%+ of the problems - but it still occasionally
>>> happens, and I'm having trouble tracing further, because the crash logs
>>> show the crashed threads as doing things like these: (each one taken from a
>>> separate log where it's the only crashed thread)
>>>
>>>
>>> Thread 40 Crashed:: 8_scheduler
>>>> 0   beam.smp                          0x000000001c19980b process_main +
>>>> 1570
>>>>
>>>> Thread 5 Crashed:: 3_scheduler
>>>> 0   beam.smp                          0x000000001c01d80b process_main +
>>>> 1570
>>>>
>>>> Thread 7 Crashed:: 5_scheduler
>>>> 0   beam.smp                          0x000000001baff0b8 lists_member_2
>>>> + 63
>>>>
>>>> Thread 3 Crashed:: 1_scheduler
>>>> 0   beam.smp                          0x000000001d4b780b process_main +
>>>> 1570
>>>>
>>>> Thread 5 Crashed:: 3_scheduler
>>>> 0   beam.smp                          0x000000001fcf280b process_main +
>>>> 1570
>>>>
>>>> Thread 6 Crashed:: 4_scheduler
>>>> 0   beam.smp                          0x000000001ae290b8 lists_member_2
>>>> + 63
>>>>
>>>
>>> I'm very confident that the problems are in my code, not in the
>>> scheduler ;-) But without more detail, I don't know how to trace where
>>> they're happening. When they do, there are sometimes other threads doing
>>> things in my code (maybe 20% of the time) - but mostly not, and on the
>>> occasions when they are, I've been unable to see what the problem might be
>>> on the lines referenced.
>>>
>>> It seems like it's some kind of cross-thread data access issue, but I
>>> don't know how to track it down.
>>>
>>> Some more context about what's going on. My NIF load() function starts a
>>> thread which passes a callback function to a library that talks to some
>>> hardware, which calls the callback when it has a message. It's a separate
>>> thread because the library only calls back to the thread that initialized
>>> it; when I ran it directly in NIF load(), it didn't call back, but in the
>>> VM-managed thread, it works as expected. The thread sits and waits for
>>> stuff to happen, and callbacks come when they should.
>>>
>>> I use enif_thread_create/enif_thread_opts_create to start the thread,
>>> and use enif_alloc/enif_free everywhere. I keep a static pointer in the NIF
>>> to a couple of members of the state struct, as that seems the only way to
>>> reference them in the callback function. The struct is kept in NIF private
>>> data: I pass **priv from load() to the thread_main function, allocate the
>>> state struct using enif_alloc in thread_main, and set priv pointing to the
>>> state struct, also in the thread. Other NIF functions do access members of
>>> the state struct, but only ever through enif_priv_data( env ).
>>>
>>> The vast majority of the time it all works perfectly, humming along very
>>> nicely, but every now and then, without any real pattern I can see, it just
>>> segfaults and the VM comes down. It's only happened 3 times in the last 20+
>>> hours of working on the app, testing & running all the while, doing VM
>>> starts, stops, code reloads, etc. But when it happens, it's kind of a
>>> showstopper, and I'd really like to nail it down.
>>>
>>> This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM
>>> version 9.0.0 (clang-900.0.38).
>>>
>>> Any ideas on how/where to look next to try to track this down? Hope it's
>>> not something structural in the above which just won't work.
>>>
>>> Cheers,
>>> Igor
>>>
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180529/185d9596/attachment.htm>