[erlang-questions] How to track down intermittent segfaults in a threaded NIF

Igor Clark igor.clark@REDACTED
Fri Jun 1 01:31:51 CEST 2018


Hey again - just closing the loop here. I had another even less frequent 
bug that was still triggering erts_exit(ERTS_ABORT_EXIT, ...), only 
happening once in a blue moon, so I fired up guard malloc with the "+Mea 
min" option, and this time not only did it give an immediate crash, but 
it also gave me the exact line number :-))

I can't tell you how useful this is - I've had these intermittent 
problems only showing up every now and then for a good while, with no 
real way to track them down, and so they've been grumbling at the back 
of my mind all that time. So the combination of the erl option and 
libgmalloc is just an amazing tool for me to hunt down this kind of 
issue. Thanks so much for telling me about it!

Very best,
Igor

On 30/05/2018 10:19, Igor Clark wrote:
> Thanks Dominic - I don't want to count my chickens before they've 
> hatched, but it looks like guard malloc has pointed me to at least 
> some bugs even without that VM option. Even though I wasn't getting a 
> line number in the stack trace, it was already seeming to make the NIF 
> crash immediately and consistently, so I was able to use a ton of 
> debug print statements to track down two problems that I hadn't been 
> able to see before. (One was an enif_alloc() in the wrong place, and 
> another seems to have been accessing a pointer from a function in a 
> shared object file, oops.) No way would I have seen them without guard 
> malloc showing me the way, it's a powerful tool :-)
>
> So I fixed those two, and right now the app is running as expected 
> without crashes under guard malloc. I'm pretty sure that I'll come up 
> against more illegal-access bugs over time, so I'm adding "+Mea min" 
> to the list of options to use when I find the next one. Thank you.
>
> Thanks very much also to everyone who replied, particularly Scott for 
> the guard malloc suggestion & help, and Fred & Tristan for the rebar3 
> tips so I could add the necessary CLI options and track down what was 
> going on. I'm very glad to have been able to ask such experienced 
> folks for advice, and to have learned about some *extremely* useful 
> new stuff.
>
> Cheers,
> Igor
>
>
> On 29/05/2018 23:58, Dominic Morneau wrote:
>> Can you give it a try with "+Mea min" in erl options? This should 
>> make Erlang fall back to malloc for all allocators, hopefully making 
>> guard malloc more effective.
>>
>> Dominic
>>
>> 2018年5月30日(水) 5:15 Igor Clark <igor.clark@REDACTED 
>> <mailto:igor.clark@REDACTED>>:
>>
>>     OK. Thanks very much Scott. I've got all this working using both
>>     those
>>     extra options, and it does seem to make the NIF crash a lot
>>     sooner than
>>     previously, which is great. But I'm still only seeing
>>     "process_main" in
>>     the crashed thread, so I'm not much closer to knowing where the
>>     illegal
>>     access is. I wonder if it's in lots of places because of what I'm
>>     doing
>>     with the callback and the thread. I hope not.
>>
>>     I'll do some more digging, and tomorrow I'll try out a debug
>>     emulator
>>     build as well.
>>
>>     Thanks very much for helping me get this far!
>>
>>     On 29/05/2018 16:31, Scott Ribe wrote:
>>     >> On May 29, 2018, at 9:16 AM, Igor Clark <igor.clark@REDACTED
>>     <mailto:igor.clark@REDACTED>> wrote:
>>     >>
>>     >> So, do I have this right: the point of the Guard Malloc is to
>>     make the crash happen at the time of allocation, rather than
>>     delayed until something trying to access it triggers the
>>     segfault; so if I get a crash while running like this, I should
>>     be able to just check in the Console debug log, and the stack
>>     trace should show where the bug actually is?
>>     > At the time of the illegal access, not the allocation. Yes,
>>     that's the point, you get a stack trace showing you illegal access.
>>     >
>>     > However, the BEAM allocator will reduce its effectiveness. When
>>     you malloc in your C code, you get a block set up such that
>>     accessing just past it (or potentially before it) will cause an
>>     immediate crash. When you free it, it's then set up such that
>>     accessing will cause an immediate crash. But if you use Erlang's
>>     allocation routines, Erlang may malloc a bigger block with those
>>     protections, then hand out multiple suballocations, and access
>>     beyond the end of one of those can simply corrupt the next one
>>     without crashing at that point.
>>     >
>>     > You should also be using MallocScribble & MallocPreScribble.
>>     >
>>     >
>>     >
>>
>>     _______________________________________________
>>     erlang-questions mailing list
>>     erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>>     http://erlang.org/mailman/listinfo/erlang-questions
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180601/22bf014f/attachment.htm>


More information about the erlang-questions mailing list