[erlang-questions] dirty scheduler segfault

Steve Vinoski vinoski@REDACTED
Tue Nov 4 17:55:35 CET 2014


On Tue, Nov 4, 2014 at 11:01 AM, Steve Vinoski <vinoski@REDACTED> wrote:

>
>
> On Tue, Nov 4, 2014 at 9:46 AM, Sverker Eriksson <
> sverker.eriksson@REDACTED> wrote:
>
>>
>> On 10/31/2014 10:05 PM, Steve Vinoski wrote:
>>
>>> On Fri, Oct 31, 2014 at 4:33 PM, Daniel Goertzen <
>>> daniel.goertzen@REDACTED>
>>> wrote:
>>>
>>>  I am seeing a segfault that seems to be related to dirty schedulers.
>>>> I've
>>>> reduced the fault to the erlang and C nif module below which executes
>>>> the
>>>> same nif with either the io dirty scheduler, the cpu dirty scheduler, or
>>>> the normal erlang scheduler.
>>>>
>>>>
>>>> When I start the emulator and run either dirty nif, I get a segfault. (
>>>> see https://gist.github.com/goertzenator/6237e0200a5f7bf22976)
>>>>
>>>>  I found it hard to make sense of what's in that gist due to the
>>> formatting,
>>> so I took your code and built it myself. When I ran it, it failed in your
>>> NIF load function, but it failed in a way that didn't make sense because
>>> all your function does is return 0. Then I realized none of your C
>>> functions were declared static, which means they are global, and I
>>> suspected your load() function was clashing with some other function of
>>> the
>>> same name. I made all your C functions static, rebuilt, and then ran
>>> everything and it seems like it worked:
>>>
>>>  c(dlibusb).
>>>>
>>> Reading symbols for shared libraries . done
>>> {ok,dlibusb}
>>> 2> dlibusb:mytest_cpu().
>>> [ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok]
>>> 3> dlibusb:mytest_io().
>>> [ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok]
>>> 4> dlibusb:mytest_none().
>>> [ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok]
>>>
>>> --steve
>>>
>>>
>>>
>> Run on debug VM and increase 'cnt' in the NIF mytest to something bigger
>> (like 1000) and this will segfault every time.
>>
>> The problem arise when a 0-arity dirty NIF like mytest triggers a GC. The
>> return value from the NIF
>> is not included in the rootset of the GC (as it should be) and the
>> calling erlang code crashes when it later tries to
>> read deallocated garbage.
>>
>> I did the following fix in init_nif_sched_data() which seems to work.
>>
>>     ep->fp = indirect_fp;
>>      proc->freason = TRAP;
>> +    proc->arity = argc;
>>      return THE_NON_VALUE;
>>  }
>>
>>
>> Not sure if that is always the right thing to do.
>> What do you think, Steve?
>>
>
> Thanks Sverker, glad you were able to reproduce the problem -- I've tried
> and tried but have never gotten it to fail. Increasing the array size also
> makes it reliably crash for me. I'll investigate your proposed fix and will
> probably add a new test for this.
>

Thanks again Sverker, this is definitely the right fix. I've submitted a PR
for this:

https://github.com/erlang/otp/pull/531

And Daniel, thanks for finding and reporting this. Sorry I couldn't
reproduce it sooner.

--steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141104/374f3892/attachment.htm>


More information about the erlang-questions mailing list