[erlang-questions] How to track down intermittent segfaults in a threaded NIF

Tue May 29 14:45:46 CEST 2018

I don't know how to make rebar3 run the debug emulator, but a quick and
dirty trick that I do when all else fails is to copy the beam.debug.smp
file over the beam.smp file.

You probably also have to copy the erl_child_setup.debug file, that file
should however have the .debug suffix remaining. So:

cp bin/`erts/autoconf/config.guess`/beam.debug.smp
path/to/release/erts-v.s.n/bin/beam.smp
cp bin/`erts/autoconf/config.guess`/erl_child_setup.debug
path/to/release/erts-v.s.n/bin/

On Tue, May 29, 2018 at 1:30 PM, Igor Clark <igor.clark@REDACTED> wrote:

> Thanks very much Lukas, I think the debug emulator could be what I'm
> looking for. The NIF only sometimes crashes on lists:member/2 - those log
> lines are all from different crashes (there's only one crashed thread each
> time), and sometimes it just crashes on process_main. So I think I might
> need the debug emulator to trace further.
>
> However I have a lot to learn about how to integrate C tooling with
> something so complex. When I run the debug emulator, does it just show more
> detailed info in stack traces, or will I need to attach gdb/lldb etc to
> find out what's going on? Is there any more info on how to set this all up?
>
> Also, not 100% sure how to run it, as I run my app with "rebar3 shell"
> from a release layout during development, or the same inside the
> NIF-specific app when trying to track problems down there. The doc you
> linked says:
>
> To start the debug enabled runtime system execute:
>
> $ $ERL_TOP/bin/cerl -debug
>
>
> I realise these are more rebar3 than erlang questions, but I can't find
> much in the rebar3 docs about them:
>
> - How should I specify that rebar3 should run "cerl" instead of "erl" ?
>
> - Should I just add "-debug" in my "config/vm.args" or is there another
> way to do this?
>
> Thank you for your help!
> i
>
>
> On 29/05/2018 11:30, Lukas Larsson wrote:
>
> Have you tried to run your code in a debug emulator? https://github.com/
> erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-
> debug-enabled-erlang-runtime-system
>
> Since it seems to be segfaulting in lists:member/2, I would guess that
> your nif somehow builds an invalid list that later is used by
> lists:member/2.
>
> On Tue, May 29, 2018 at 11:04 AM, Igor Clark <igor.clark@REDACTED> wrote:
>
>> Thanks Sergej - that's where I got the thread reports I pasted in below,
>> from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.
>>
>> Each log says the only crashed thread was a scheduler thread, for example
>> "8_scheduler" running "process_main" in the case of the first one below.
>> This is how I tracked down a bunch of errors in my own code, but the only
>> ones that still happen are in the scheduler, according to the Console crash
>> logs.
>>
>> The thing is, it seems really unlikely that a VM running my NIF code
>> would just happen to be crashing in the scheduler rather than my code(!) -
>> so that's what I'm trying to work out, how to find out what's actually
>> going on, given that the log tells me the crashed thread is running
>> "process_main" or 'lists_member_2'.
>>
>> Any suggestions welcome!
>>
>> Cheers,
>> Igor
>>
>>
>> On 29/05/2018 04:16, Sergej Jurečko wrote:
>>
>>> On macOS there is a quick way to get a stack trace if you compiled with
>>> debug symbols.
>>> Open /Applications/Utilities/Console
>>> Go to: User Reports
>>>
>>> You will see beam.smp in there if it crashed. Click on it and you get a
>>> report what every thread was calling at the time of crash.
>>>
>>>
>>> Regards,
>>> Sergej
>>>
>>> On 28 May 2018, at 23:46, Igor Clark <igor.clark@REDACTED> wrote:
>>>>
>>>> Hi folks, hope all well,
>>>>
>>>> I have a NIF which very occasionally segfaults, intermittently and
>>>> apparently unpredictably, bringing down the VM. I've spent a bunch of time
>>>> tracing allocation and dereferencing problems in my NIF code, and I've got
>>>> rid of what seems like 99%+ of the problems - but it still occasionally
>>>> happens, and I'm having trouble tracing further, because the crash logs
>>>> show the crashed threads as doing things like these: (each one taken from a
>>>> separate log where it's the only crashed thread)
>>>>
>>>>
>>>> Thread 40 Crashed:: 8_scheduler
>>>>> 0   beam.smp                          0x000000001c19980b process_main
>>>>> + 1570
>>>>>
>>>>> Thread 5 Crashed:: 3_scheduler
>>>>> 0   beam.smp                          0x000000001c01d80b process_main
>>>>> + 1570
>>>>>
>>>>> Thread 7 Crashed:: 5_scheduler
>>>>> 0   beam.smp                          0x000000001baff0b8
>>>>> lists_member_2 + 63
>>>>>
>>>>> Thread 3 Crashed:: 1_scheduler
>>>>> 0   beam.smp                          0x000000001d4b780b process_main
>>>>> + 1570
>>>>>
>>>>> Thread 5 Crashed:: 3_scheduler
>>>>> 0   beam.smp                          0x000000001fcf280b process_main
>>>>> + 1570
>>>>>
>>>>> Thread 6 Crashed:: 4_scheduler
>>>>> 0   beam.smp                          0x000000001ae290b8
>>>>> lists_member_2 + 63
>>>>>
>>>>
>>>> I'm very confident that the problems are in my code, not in the
>>>> scheduler ;-) But without more detail, I don't know how to trace where
>>>> they're happening. When they do, there are sometimes other threads doing
>>>> things in my code (maybe 20% of the time) - but mostly not, and on the
>>>> occasions when they are, I've been unable to see what the problem might be
>>>> on the lines referenced.
>>>>
>>>> It seems like it's some kind of cross-thread data access issue, but I
>>>> don't know how to track it down.
>>>>
>>>> Some more context about what's going on. My NIF load() function starts
>>>> a thread which passes a callback function to a library that talks to some
>>>> hardware, which calls the callback when it has a message. It's a separate
>>>> thread because the library only calls back to the thread that initialized
>>>> it; when I ran it directly in NIF load(), it didn't call back, but in the
>>>> VM-managed thread, it works as expected. The thread sits and waits for
>>>> stuff to happen, and callbacks come when they should.
>>>>
>>>> I use enif_thread_create/enif_thread_opts_create to start the thread,
>>>> and use enif_alloc/enif_free everywhere. I keep a static pointer in the NIF
>>>> to a couple of members of the state struct, as that seems the only way to
>>>> reference them in the callback function. The struct is kept in NIF private
>>>> data: I pass **priv from load() to the thread_main function, allocate the
>>>> state struct using enif_alloc in thread_main, and set priv pointing to the
>>>> state struct, also in the thread. Other NIF functions do access members of
>>>> the state struct, but only ever through enif_priv_data( env ).
>>>>
>>>> The vast majority of the time it all works perfectly, humming along
>>>> very nicely, but every now and then, without any real pattern I can see, it
>>>> just segfaults and the VM comes down. It's only happened 3 times in the
>>>> last 20+ hours of working on the app, testing & running all the while,
>>>> doing VM starts, stops, code reloads, etc. But when it happens, it's kind
>>>> of a showstopper, and I'd really like to nail it down.
>>>>
>>>> This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM
>>>> version 9.0.0 (clang-900.0.38).
>>>>
>>>> Any ideas on how/where to look next to try to track this down? Hope
>>>> it's not something structural in the above which just won't work.
>>>>
>>>> Cheers,
>>>> Igor
>>>>
>>>>
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> erlang-questions@REDACTED
>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180529/5ea312fc/attachment.htm>