[erlang-questions] dirty scheduler segfault

Steve Vinoski vinoski@REDACTED
Tue Nov 4 17:01:56 CET 2014


On Tue, Nov 4, 2014 at 9:46 AM, Sverker Eriksson <
sverker.eriksson@REDACTED> wrote:

>
> On 10/31/2014 10:05 PM, Steve Vinoski wrote:
>
>> On Fri, Oct 31, 2014 at 4:33 PM, Daniel Goertzen <
>> daniel.goertzen@REDACTED>
>> wrote:
>>
>>  I am seeing a segfault that seems to be related to dirty schedulers.
>>> I've
>>> reduced the fault to the erlang and C nif module below which executes the
>>> same nif with either the io dirty scheduler, the cpu dirty scheduler, or
>>> the normal erlang scheduler.
>>>
>>>
>>> When I start the emulator and run either dirty nif, I get a segfault. (
>>> see https://gist.github.com/goertzenator/6237e0200a5f7bf22976)
>>>
>>>  I found it hard to make sense of what's in that gist due to the
>> formatting,
>> so I took your code and built it myself. When I ran it, it failed in your
>> NIF load function, but it failed in a way that didn't make sense because
>> all your function does is return 0. Then I realized none of your C
>> functions were declared static, which means they are global, and I
>> suspected your load() function was clashing with some other function of
>> the
>> same name. I made all your C functions static, rebuilt, and then ran
>> everything and it seems like it worked:
>>
>>  c(dlibusb).
>>>
>> Reading symbols for shared libraries . done
>> {ok,dlibusb}
>> 2> dlibusb:mytest_cpu().
>> [ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok]
>> 3> dlibusb:mytest_io().
>> [ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok]
>> 4> dlibusb:mytest_none().
>> [ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok]
>>
>> --steve
>>
>>
>>
> Run on debug VM and increase 'cnt' in the NIF mytest to something bigger
> (like 1000) and this will segfault every time.
>
> The problem arise when a 0-arity dirty NIF like mytest triggers a GC. The
> return value from the NIF
> is not included in the rootset of the GC (as it should be) and the calling
> erlang code crashes when it later tries to
> read deallocated garbage.
>
> I did the following fix in init_nif_sched_data() which seems to work.
>
>     ep->fp = indirect_fp;
>      proc->freason = TRAP;
> +    proc->arity = argc;
>      return THE_NON_VALUE;
>  }
>
>
> Not sure if that is always the right thing to do.
> What do you think, Steve?
>

Thanks Sverker, glad you were able to reproduce the problem -- I've tried
and tried but have never gotten it to fail. Increasing the array size also
makes it reliably crash for me. I'll investigate your proposed fix and will
probably add a new test for this.

--steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141104/04c9e912/attachment.htm>


More information about the erlang-questions mailing list