[erlang-questions] How to track down intermittent segfaults in a threaded NIF

Peti Gömöri gomoripeti@REDACTED
Tue May 29 15:09:04 CEST 2018


since OTP 20 the *-emu_type* flag might also work eg.:
  erl -emu_type debug

and you can put it in the vm.args file too

On Tue, May 29, 2018 at 2:45 PM, Lukas Larsson <lukas@REDACTED> wrote:

> I don't know how to make rebar3 run the debug emulator, but a quick and
> dirty trick that I do when all else fails is to copy the beam.debug.smp
> file over the beam.smp file.
>
> You probably also have to copy the erl_child_setup.debug file, that file
> should however have the .debug suffix remaining. So:
>
> cp bin/`erts/autoconf/config.guess`/beam.debug.smp
> path/to/release/erts-v.s.n/bin/beam.smp
> cp bin/`erts/autoconf/config.guess`/erl_child_setup.debug
> path/to/release/erts-v.s.n/bin/
>
>
> On Tue, May 29, 2018 at 1:30 PM, Igor Clark <igor.clark@REDACTED> wrote:
>
>> Thanks very much Lukas, I think the debug emulator could be what I'm
>> looking for. The NIF only sometimes crashes on lists:member/2 - those log
>> lines are all from different crashes (there's only one crashed thread each
>> time), and sometimes it just crashes on process_main. So I think I might
>> need the debug emulator to trace further.
>>
>> However I have a lot to learn about how to integrate C tooling with
>> something so complex. When I run the debug emulator, does it just show more
>> detailed info in stack traces, or will I need to attach gdb/lldb etc to
>> find out what's going on? Is there any more info on how to set this all up?
>>
>> Also, not 100% sure how to run it, as I run my app with "rebar3 shell"
>> from a release layout during development, or the same inside the
>> NIF-specific app when trying to track problems down there. The doc you
>> linked says:
>>
>> To start the debug enabled runtime system execute:
>>
>> $ $ERL_TOP/bin/cerl -debug
>>
>>
>> I realise these are more rebar3 than erlang questions, but I can't find
>> much in the rebar3 docs about them:
>>
>> - How should I specify that rebar3 should run "cerl" instead of "erl" ?
>>
>> - Should I just add "-debug" in my "config/vm.args" or is there another
>> way to do this?
>>
>> Thank you for your help!
>> i
>>
>>
>> On 29/05/2018 11:30, Lukas Larsson wrote:
>>
>> Have you tried to run your code in a debug emulator? https://github.com/e
>> rlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-
>> enabled-erlang-runtime-system
>>
>> Since it seems to be segfaulting in lists:member/2, I would guess that
>> your nif somehow builds an invalid list that later is used by
>> lists:member/2.
>>
>> On Tue, May 29, 2018 at 11:04 AM, Igor Clark <igor.clark@REDACTED>
>> wrote:
>>
>>> Thanks Sergej - that's where I got the thread reports I pasted in below,
>>> from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.
>>>
>>> Each log says the only crashed thread was a scheduler thread, for
>>> example "8_scheduler" running "process_main" in the case of the first one
>>> below. This is how I tracked down a bunch of errors in my own code, but the
>>> only ones that still happen are in the scheduler, according to the Console
>>> crash logs.
>>>
>>> The thing is, it seems really unlikely that a VM running my NIF code
>>> would just happen to be crashing in the scheduler rather than my code(!) -
>>> so that's what I'm trying to work out, how to find out what's actually
>>> going on, given that the log tells me the crashed thread is running
>>> "process_main" or 'lists_member_2'.
>>>
>>> Any suggestions welcome!
>>>
>>> Cheers,
>>> Igor
>>>
>>>
>>> On 29/05/2018 04:16, Sergej Jurečko wrote:
>>>
>>>> On macOS there is a quick way to get a stack trace if you compiled with
>>>> debug symbols.
>>>> Open /Applications/Utilities/Console
>>>> Go to: User Reports
>>>>
>>>> You will see beam.smp in there if it crashed. Click on it and you get a
>>>> report what every thread was calling at the time of crash.
>>>>
>>>>
>>>> Regards,
>>>> Sergej
>>>>
>>>> On 28 May 2018, at 23:46, Igor Clark <igor.clark@REDACTED> wrote:
>>>>>
>>>>> Hi folks, hope all well,
>>>>>
>>>>> I have a NIF which very occasionally segfaults, intermittently and
>>>>> apparently unpredictably, bringing down the VM. I've spent a bunch of time
>>>>> tracing allocation and dereferencing problems in my NIF code, and I've got
>>>>> rid of what seems like 99%+ of the problems - but it still occasionally
>>>>> happens, and I'm having trouble tracing further, because the crash logs
>>>>> show the crashed threads as doing things like these: (each one taken from a
>>>>> separate log where it's the only crashed thread)
>>>>>
>>>>>
>>>>> Thread 40 Crashed:: 8_scheduler
>>>>>> 0   beam.smp                          0x000000001c19980b process_main
>>>>>> + 1570
>>>>>>
>>>>>> Thread 5 Crashed:: 3_scheduler
>>>>>> 0   beam.smp                          0x000000001c01d80b process_main
>>>>>> + 1570
>>>>>>
>>>>>> Thread 7 Crashed:: 5_scheduler
>>>>>> 0   beam.smp                          0x000000001baff0b8
>>>>>> lists_member_2 + 63
>>>>>>
>>>>>> Thread 3 Crashed:: 1_scheduler
>>>>>> 0   beam.smp                          0x000000001d4b780b process_main
>>>>>> + 1570
>>>>>>
>>>>>> Thread 5 Crashed:: 3_scheduler
>>>>>> 0   beam.smp                          0x000000001fcf280b process_main
>>>>>> + 1570
>>>>>>
>>>>>> Thread 6 Crashed:: 4_scheduler
>>>>>> 0   beam.smp                          0x000000001ae290b8
>>>>>> lists_member_2 + 63
>>>>>>
>>>>>
>>>>> I'm very confident that the problems are in my code, not in the
>>>>> scheduler ;-) But without more detail, I don't know how to trace where
>>>>> they're happening. When they do, there are sometimes other threads doing
>>>>> things in my code (maybe 20% of the time) - but mostly not, and on the
>>>>> occasions when they are, I've been unable to see what the problem might be
>>>>> on the lines referenced.
>>>>>
>>>>> It seems like it's some kind of cross-thread data access issue, but I
>>>>> don't know how to track it down.
>>>>>
>>>>> Some more context about what's going on. My NIF load() function starts
>>>>> a thread which passes a callback function to a library that talks to some
>>>>> hardware, which calls the callback when it has a message. It's a separate
>>>>> thread because the library only calls back to the thread that initialized
>>>>> it; when I ran it directly in NIF load(), it didn't call back, but in the
>>>>> VM-managed thread, it works as expected. The thread sits and waits for
>>>>> stuff to happen, and callbacks come when they should.
>>>>>
>>>>> I use enif_thread_create/enif_thread_opts_create to start the thread,
>>>>> and use enif_alloc/enif_free everywhere. I keep a static pointer in the NIF
>>>>> to a couple of members of the state struct, as that seems the only way to
>>>>> reference them in the callback function. The struct is kept in NIF private
>>>>> data: I pass **priv from load() to the thread_main function, allocate the
>>>>> state struct using enif_alloc in thread_main, and set priv pointing to the
>>>>> state struct, also in the thread. Other NIF functions do access members of
>>>>> the state struct, but only ever through enif_priv_data( env ).
>>>>>
>>>>> The vast majority of the time it all works perfectly, humming along
>>>>> very nicely, but every now and then, without any real pattern I can see, it
>>>>> just segfaults and the VM comes down. It's only happened 3 times in the
>>>>> last 20+ hours of working on the app, testing & running all the while,
>>>>> doing VM starts, stops, code reloads, etc. But when it happens, it's kind
>>>>> of a showstopper, and I'd really like to nail it down.
>>>>>
>>>>> This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM
>>>>> version 9.0.0 (clang-900.0.38).
>>>>>
>>>>> Any ideas on how/where to look next to try to track this down? Hope
>>>>> it's not something structural in the above which just won't work.
>>>>>
>>>>> Cheers,
>>>>> Igor
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> erlang-questions mailing list
>>>>> erlang-questions@REDACTED
>>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180529/029a08fc/attachment.htm>


More information about the erlang-questions mailing list