[erlang-questions] How to track down intermittent segfaults in a threaded NIF

Igor Clark igor.clark@REDACTED
Wed May 30 11:19:06 CEST 2018


Thanks Dominic - I don't want to count my chickens before they've 
hatched, but it looks like guard malloc has pointed me to at least some 
bugs even without that VM option. Even though I wasn't getting a line 
number in the stack trace, it was already seeming to make the NIF crash 
immediately and consistently, so I was able to use a ton of debug print 
statements to track down two problems that I hadn't been able to see 
before. (One was an enif_alloc() in the wrong place, and another seems 
to have been accessing a pointer from a function in a shared object 
file, oops.) No way would I have seen them without guard malloc showing 
me the way, it's a powerful tool :-)

So I fixed those two, and right now the app is running as expected 
without crashes under guard malloc. I'm pretty sure that I'll come up 
against more illegal-access bugs over time, so I'm adding "+Mea min" to 
the list of options to use when I find the next one. Thank you.

Thanks very much also to everyone who replied, particularly Scott for 
the guard malloc suggestion & help, and Fred & Tristan for the rebar3 
tips so I could add the necessary CLI options and track down what was 
going on. I'm very glad to have been able to ask such experienced folks 
for advice, and to have learned about some *extremely* useful new stuff.

Cheers,
Igor


On 29/05/2018 23:58, Dominic Morneau wrote:
> Can you give it a try with "+Mea min" in erl options? This should make 
> Erlang fall back to malloc for all allocators, hopefully making guard 
> malloc more effective.
>
> Dominic
>
> 2018年5月30日(水) 5:15 Igor Clark <igor.clark@REDACTED 
> <mailto:igor.clark@REDACTED>>:
>
>     OK. Thanks very much Scott. I've got all this working using both
>     those
>     extra options, and it does seem to make the NIF crash a lot sooner
>     than
>     previously, which is great. But I'm still only seeing
>     "process_main" in
>     the crashed thread, so I'm not much closer to knowing where the
>     illegal
>     access is. I wonder if it's in lots of places because of what I'm
>     doing
>     with the callback and the thread. I hope not.
>
>     I'll do some more digging, and tomorrow I'll try out a debug emulator
>     build as well.
>
>     Thanks very much for helping me get this far!
>
>     On 29/05/2018 16:31, Scott Ribe wrote:
>     >> On May 29, 2018, at 9:16 AM, Igor Clark <igor.clark@REDACTED
>     <mailto:igor.clark@REDACTED>> wrote:
>     >>
>     >> So, do I have this right: the point of the Guard Malloc is to
>     make the crash happen at the time of allocation, rather than
>     delayed until something trying to access it triggers the segfault;
>     so if I get a crash while running like this, I should be able to
>     just check in the Console debug log, and the stack trace should
>     show where the bug actually is?
>     > At the time of the illegal access, not the allocation. Yes,
>     that's the point, you get a stack trace showing you illegal access.
>     >
>     > However, the BEAM allocator will reduce its effectiveness. When
>     you malloc in your C code, you get a block set up such that
>     accessing just past it (or potentially before it) will cause an
>     immediate crash. When you free it, it's then set up such that
>     accessing will cause an immediate crash. But if you use Erlang's
>     allocation routines, Erlang may malloc a bigger block with those
>     protections, then hand out multiple suballocations, and access
>     beyond the end of one of those can simply corrupt the next one
>     without crashing at that point.
>     >
>     > You should also be using MallocScribble & MallocPreScribble.
>     >
>     >
>     >
>
>     _______________________________________________
>     erlang-questions mailing list
>     erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>     http://erlang.org/mailman/listinfo/erlang-questions
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180530/948475e2/attachment.htm>


More information about the erlang-questions mailing list