[erlang-questions] How to track down intermittent segfaults in a threaded NIF
Igor Clark
igor.clark@REDACTED
Tue May 29 16:08:02 CEST 2018
Thanks Lukas and Peti, that's great. "erl -emu_type debug" definitely
works - I haven't made the debug build yet but I get "erlexec: The
emulator
'/usr/local/Cellar/erlang/20.3.4/lib/erlang/erts-9.3/bin/beam.debug.smp'
does not exist", which is what I want. I'll get onto the debug build and
see what I can find out.
In case anyone else wants to use that in rebar3 shell, I found
http://www.rebar3.org/v3.0/discuss/5745fb105528582000dfb47f which shows
you can set ERL_FLAGS to just set -emu_type directly, or specify vm.args
so you can then set it in there, e.g.:
> ERL_FLAGS=" -args_file config/vm.args -config config/sys.config"
> rebar3 shell
Cheers,
Igor
On 29/05/2018 14:09, Peti Gömöri wrote:
> since OTP 20 the *-emu_type* flag might also work eg.:
> erl -emu_type debug
>
> and you can put it in the vm.args file too
>
> On Tue, May 29, 2018 at 2:45 PM, Lukas Larsson <lukas@REDACTED
> <mailto:lukas@REDACTED>> wrote:
>
> I don't know how to make rebar3 run the debug emulator, but a
> quick and dirty trick that I do when all else fails is to copy the
> beam.debug.smp file over the beam.smp file.
>
> You probably also have to copy the erl_child_setup.debug file,
> that file should however have the .debug suffix remaining. So:
>
> cp bin/`erts/autoconf/config.guess`/beam.debug.smp
> path/to/release/erts-v.s.n/bin/beam.smp
> cp bin/`erts/autoconf/config.guess`/erl_child_setup.debug
> path/to/release/erts-v.s.n/bin/
>
>
> On Tue, May 29, 2018 at 1:30 PM, Igor Clark <igor.clark@REDACTED
> <mailto:igor.clark@REDACTED>> wrote:
>
> Thanks very much Lukas, I think the debug emulator could be
> what I'm looking for. The NIF only sometimes crashes on
> lists:member/2 - those log lines are all from different
> crashes (there's only one crashed thread each time), and
> sometimes it just crashes on process_main. So I think I might
> need the debug emulator to trace further.
>
> However I have a lot to learn about how to integrate C tooling
> with something so complex. When I run the debug emulator, does
> it just show more detailed info in stack traces, or will I
> need to attach gdb/lldb etc to find out what's going on? Is
> there any more info on how to set this all up?
>
> Also, not 100% sure how to run it, as I run my app with
> "rebar3 shell" from a release layout during development, or
> the same inside the NIF-specific app when trying to track
> problems down there. The doc you linked says:
>
>> To start the debug enabled runtime system execute:
>>
>> |$ $ERL_TOP/bin/cerl -debug|
>
> I realise these are more rebar3 than erlang questions, but I
> can't find much in the rebar3 docs about them:
>
> - How should I specify that rebar3 should run "cerl" instead
> of "erl" ?
>
> - Should I just add "-debug" in my "config/vm.args" or is
> there another way to do this?
>
> Thank you for your help!
> i
>
>
> On 29/05/2018 11:30, Lukas Larsson wrote:
>> Have you tried to run your code in a debug emulator?
>> https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system
>> <https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system>
>>
>>
>> Since it seems to be segfaulting in lists:member/2, I would
>> guess that your nif somehow builds an invalid list that later
>> is used by lists:member/2.
>>
>> On Tue, May 29, 2018 at 11:04 AM, Igor Clark
>> <igor.clark@REDACTED <mailto:igor.clark@REDACTED>> wrote:
>>
>> Thanks Sergej - that's where I got the thread reports I
>> pasted in below, from e.g.
>> 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.
>>
>> Each log says the only crashed thread was a scheduler
>> thread, for example "8_scheduler" running "process_main"
>> in the case of the first one below. This is how I tracked
>> down a bunch of errors in my own code, but the only ones
>> that still happen are in the scheduler, according to the
>> Console crash logs.
>>
>> The thing is, it seems really unlikely that a VM running
>> my NIF code would just happen to be crashing in the
>> scheduler rather than my code(!) - so that's what I'm
>> trying to work out, how to find out what's actually going
>> on, given that the log tells me the crashed thread is
>> running "process_main" or 'lists_member_2'.
>>
>> Any suggestions welcome!
>>
>> Cheers,
>> Igor
>>
>>
>> On 29/05/2018 04:16, Sergej Jurečko wrote:
>>
>> On macOS there is a quick way to get a stack trace if
>> you compiled with debug symbols.
>> Open /Applications/Utilities/Console
>> Go to: User Reports
>>
>> You will see beam.smp in there if it crashed. Click
>> on it and you get a report what every thread was
>> calling at the time of crash.
>>
>>
>> Regards,
>> Sergej
>>
>> On 28 May 2018, at 23:46, Igor Clark
>> <igor.clark@REDACTED
>> <mailto:igor.clark@REDACTED>> wrote:
>>
>> Hi folks, hope all well,
>>
>> I have a NIF which very occasionally segfaults,
>> intermittently and apparently unpredictably,
>> bringing down the VM. I've spent a bunch of time
>> tracing allocation and dereferencing problems in
>> my NIF code, and I've got rid of what seems like
>> 99%+ of the problems - but it still occasionally
>> happens, and I'm having trouble tracing further,
>> because the crash logs show the crashed threads
>> as doing things like these: (each one taken from
>> a separate log where it's the only crashed thread)
>>
>>
>> Thread 40 Crashed:: 8_scheduler
>> 0 beam.smp 0x000000001c19980b process_main
>> + 1570
>>
>> Thread 5 Crashed:: 3_scheduler
>> 0 beam.smp 0x000000001c01d80b process_main
>> + 1570
>>
>> Thread 7 Crashed:: 5_scheduler
>> 0 beam.smp 0x000000001baff0b8
>> lists_member_2 + 63
>>
>> Thread 3 Crashed:: 1_scheduler
>> 0 beam.smp 0x000000001d4b780b process_main
>> + 1570
>>
>> Thread 5 Crashed:: 3_scheduler
>> 0 beam.smp 0x000000001fcf280b process_main
>> + 1570
>>
>> Thread 6 Crashed:: 4_scheduler
>> 0 beam.smp 0x000000001ae290b8
>> lists_member_2 + 63
>>
>>
>> I'm very confident that the problems are in my
>> code, not in the scheduler ;-) But without more
>> detail, I don't know how to trace where they're
>> happening. When they do, there are sometimes
>> other threads doing things in my code (maybe 20%
>> of the time) - but mostly not, and on the
>> occasions when they are, I've been unable to see
>> what the problem might be on the lines referenced.
>>
>> It seems like it's some kind of cross-thread data
>> access issue, but I don't know how to track it down.
>>
>> Some more context about what's going on. My NIF
>> load() function starts a thread which passes a
>> callback function to a library that talks to some
>> hardware, which calls the callback when it has a
>> message. It's a separate thread because the
>> library only calls back to the thread that
>> initialized it; when I ran it directly in NIF
>> load(), it didn't call back, but in the
>> VM-managed thread, it works as expected. The
>> thread sits and waits for stuff to happen, and
>> callbacks come when they should.
>>
>> I use enif_thread_create/enif_thread_opts_create
>> to start the thread, and use enif_alloc/enif_free
>> everywhere. I keep a static pointer in the NIF to
>> a couple of members of the state struct, as that
>> seems the only way to reference them in the
>> callback function. The struct is kept in NIF
>> private data: I pass **priv from load() to the
>> thread_main function, allocate the state struct
>> using enif_alloc in thread_main, and set priv
>> pointing to the state struct, also in the thread.
>> Other NIF functions do access members of the
>> state struct, but only ever through
>> enif_priv_data( env ).
>>
>> The vast majority of the time it all works
>> perfectly, humming along very nicely, but every
>> now and then, without any real pattern I can see,
>> it just segfaults and the VM comes down. It's
>> only happened 3 times in the last 20+ hours of
>> working on the app, testing & running all the
>> while, doing VM starts, stops, code reloads, etc.
>> But when it happens, it's kind of a showstopper,
>> and I'd really like to nail it down.
>>
>> This is all happening in Erlang 20.3.4 on MacOS
>> 10.12.6 / Apple LLVM version 9.0.0 (clang-900.0.38).
>>
>> Any ideas on how/where to look next to try to
>> track this down? Hope it's not something
>> structural in the above which just won't work.
>>
>> Cheers,
>> Igor
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> <mailto:erlang-questions@REDACTED>
>> http://erlang.org/mailman/listinfo/erlang-questions
>> <http://erlang.org/mailman/listinfo/erlang-questions>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> <mailto:erlang-questions@REDACTED>
>> http://erlang.org/mailman/listinfo/erlang-questions
>> <http://erlang.org/mailman/listinfo/erlang-questions>
>>
>>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
> http://erlang.org/mailman/listinfo/erlang-questions
> <http://erlang.org/mailman/listinfo/erlang-questions>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
> http://erlang.org/mailman/listinfo/erlang-questions
> <http://erlang.org/mailman/listinfo/erlang-questions>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180529/7f60c352/attachment.htm>
More information about the erlang-questions
mailing list