[erlang-bugs] NIF .so reload issues

Sverker Eriksson sverker.eriksson@REDACTED
Fri Dec 18 18:55:27 CET 2015



On 12/18/2015 06:07 PM, David Buckley wrote:
> On Fri, Dec 18, 2015 at 05:41:24PM +0100, Sverker Eriksson wrote:
>> Hi David,
>>
>> Yes, this is a dlopen restriction and also an ambiguity as I've heard
>> different behaviour reported depending on OS.
>>
>> My Linux man page for dlopen says "If the same library is loaded again with
>> dlopen(),
>> the same file handle is returned". But it does not specify what "the same"
>> actually means.
>>
>> The Erlang VM has to keep the old .so file loaded until the module is safely
>> purged [*]
>> as there may exist Erlang processes still lingering in the old code. Trying
>> to execute
>> unloaded native code does not behave well.
>>
>> When you call load_nif with the same library name (as the
>> not yet purged one), dlopen thinks it's "the same" library
>> and just returns the same handle again.
>>
>> What to do?
>>
>> Rename the .so library, give it a version number. Or maybe
>> put it in a different directory will work (?).
>>
>> Add something about this problem to the erl_nif docs. Yes that would be
>> nice.
>>
>> I'm hesitant to recommend purging in on_load. The on_load feature
>> is still experimental and we have some known problems with bad
>> behaviour, especially in the error cases when on_load fails.
>> To fix that we may have to limit what you are allowed
>> to do in on_load and code purging might be such a limitation.
>>
>>
>> [*] Purging may actually not be enough. If the NIF library has created
>> resource objects with a destructor callcack, it will not be unloaded until
>> the last resource objects has been garbage collected.
> Hmmm, I was going to create resources!
>
> I guess for development I'll add a hack that just creates a link to the
> file with a temporary name before loading it, so that a new handle to it
> is created each time.
I'm not sure dlopen is fooled by a link. You may need to make a real copy.

> There /is/ a secret RTLD_PRIVATE flag for dlopen
> -- that is not apparently supported on any OS mentioned on the first
> page of google -- to get a private instance.
>
> For production, versioning the library code ought to be fine. Most
> system libraries already contain version numbers in the filename, and I
> suppose this is part of why. It's only reloading for rapid development
> that is causing pain here!
>
> Is the old dlopen bound to the old (Erlang) code?
Yes. A NIF library is like an extension of the Erlang code
that loaded it.

> That is, if I
> instigate this hack, and leak resources somehow while reloading often,
> will I have problems reloading the module, cause processes to be
> violently uprooted as with purge, or simply leak dlopen handles until I
> clean up?
If you leak resources then you will also leak the loaded libraries
that contain the destructor functions  of those resources.

However, there is way for your upgraded NIF library to take over
ownership of existing resources by passing the ERL_NIF_RT_TAKEOVER
flag to enif_open_resource_type(). By doing that, the destructor in your
new library will be called instead and the old library can be unloaded
when the module is purged. Your new library versions must of course
be data compatible and know how to handle old resources.

>
> Is there any chance of purge/soft_purge being extended to cover nif
> resources?
>
Oh, that's a good question. Why don't we do that already?
I have to think about that.


/Sverker, Erlang/OTP





More information about the erlang-bugs mailing list