BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Ulf Wiger ulf@REDACTED
Tue Mar 23 13:45:07 CET 2021


Permissions are a bit tricky. :)

The issue I encountered was the use of `application:ensure_all_started()`
in `rebar3 shell`, which I think is perfectly appropriate use of the
function.
But if that use is supposed to mimic the starting of applications via a
boot script, then it should either return `ok` (counter-intuitive), or hang
until the permission flag(s) turn to true.

The `application:close()` function should definitely remove the
corresponding entries in `start_p_false`, and if there are such entries,
not complain that the app hasn't been started.

Basically, once an application has been started, the permission flag should
toggle it on or off - if you call `permit(A, false)` on running application
A, it should be stopped. A subsequent call to `permit(A, true)` should
start it again.

This is why I think it might be better for `ensure_all_started()` to hang
rather than returning an error tuple. Alternatively, that a new function,
`await_all_started()` does this, if that's what's desired.
The question then becomes which one 'rebar3 shell` should use...

BR,
Ulf W

On Tue, Mar 23, 2021 at 1:13 PM Fred Hebert <mononcqc@REDACTED> wrote:

> I initially implemented `application:ensure_all_started/1,2' because it
> made for an easier get-started mechanism than getting a full blown release,
> and avoided having people do the annoying "try until it works" garbage
> routine by hand (which the function is now doing), or having to just line
> up all the start calls at the beginning of a test run.
>
> I would be in favour of just letting it fail and return back {error, {b,
> {permit, false}}} -- there are more variations we could use, but this term
> allows to:
> - specifically mention it's b that isn't allowed to start
> - uses the 'permit' keyword that matches the function name, such that
> googling something like "erlang permit false" will yield things such as the
> doc page for application:permit/2 (or this thread, given how google
> indexing works)
>
> That being said, it does not line up with the current API usage, which
> would probably need some fixing of some sort? Look at this session:
>
> 1> application:load(ssl).
> ok
> 2> application:permit(ssl, false).
> ok
> 3> application:ensure_all_started(crypto).
> {ok,[crypto]}
> 4> application:ensure_all_started(public_key).
> {ok,[asn1,public_key]}
> 5> application:start(ssl).
> ok
> 6> application:stop(ssl).
> {error,{not_started,ssl}}
>
> Starting the application "works" even if it does not with permissions; the
> call silently fails while returning it succeeded. Since the current
> documentation for ensure_all_started states the following:
>
>> The function reports {error, {AppName,Reason}} for errors, where Reason
>> is any possible reason returned by start/1,2
>> <http://erlang.org/doc/apps/kernel/application.html#start-1> when
>> starting a specific dependency.
>>
> We are in a situation where ensure_all_started can't or won't know that
> permissions are at play (I didn't when I implemented it), and it appears
> that we would need to do an explicit permission check before each run to
> properly return the error explaining the issue without breaking the
> compatibility of start/1,2. The gotcha here is that there's apparently no
> way to access this status aside from application_controller internals that
> would need some extra visibility, which start/1,2 calls on its own as well.
>
> Regards,
> Fred.
>
> On Mon, Mar 22, 2021 at 1:06 PM Ulf Wiger <ulf@REDACTED> wrote:
>
>> Hmm, trying some more with OTP 24, it addresses the problem with the
>> memory growth, but still isn't permission-aware.
>>
>> Consider test apps a and b, where a depends on b.
>>
>> 15> application:permit(b,false).
>> ok
>> 16> application:ensure_all_started(a).
>> {error,{a,{not_started,b}}}
>> 17> application:which_applications().
>> [{stdlib,"ERTS  CXC 138 10","3.15"},
>>  {kernel,"ERTS  CXC 138 10","8.0"}]
>> 18> application:permit(b,true).
>> ok
>> 19> application:which_applications().
>> [{b,"test app","0.1"},
>>  {stdlib,"ERTS  CXC 138 10","3.15"},
>>  {kernel,"ERTS  CXC 138 10","8.0"}]
>>
>> The call to application:ensure_all_started(a) fails, and supposedly all
>> child apps that were started will have been stopped again, and it does look
>> that way.
>>
>> But if we later permit b to run, it turns out that the start request
>> wasn't actually removed, and b pops up.
>>
>> This is for sure a much less serious problem than the previous one.
>>
>> However, I'm not sure if returning error is actually the right thing to
>> do there. The call SHOULD probably hang.
>>
>> Comments?
>>
>> BR,
>> Ulf W
>>
>> On Mon, Mar 22, 2021 at 1:23 PM Ulf Wiger <ulf@REDACTED> wrote:
>>
>>> When I started looking closer into this, it would appear as if there is
>>> a long-standing bug in the application_controller regarding permissions.
>>>
>>> And with "long-standing" I mean that it was there even when Kostis did
>>> some Tidier-based cleanup 11 years ago. Kostis didn't introduce it, though.
>>>
>>> When servicing a start request, the application_controller, if
>>> permission(App) == false, adds a new entry to the `start_p_false` list,
>>> i.e. a new entry for each request.
>>>
>>> https://github.com/erlang/otp/blob/master/lib/kernel/src/application_controller.erl#L689-L690
>>>
>>> ... but when servicing a subsequent {permit_application, App, true}, it
>>> uses lists:keydelete/3 to remove the App from the `start_p_false` list.
>>>
>>> https://github.com/erlang/otp/blob/master/lib/kernel/src/application_controller.erl#L759-L761
>>>
>>> lists:keydelete/3 obviously only removes the first matching entry.
>>>
>>> Earlier in that function, it also only locates the first pending request
>>> (or rather, chronologically the last), and uses the `From` in
>>> `spawn_starter()'.
>>>
>>> The rest of the pending requests should be handled somewhere - likely in
>>> `handle_application_started/3`, but aren't.
>>>
>>> BR,
>>> Ulf W
>>>
>>> On Sat, Mar 20, 2021 at 1:45 PM Mikael Pettersson <mikpelinux@REDACTED>
>>> wrote:
>>>
>>>> On Sat, Mar 20, 2021 at 8:46 AM Ulf Wiger <ulf@REDACTED> wrote:
>>>> >
>>>> > I had the brilliant idea of using application permissions for a
>>>> particular use case. This seemed to work perfectly, until I ran `rebar3
>>>> shell`, and spotted some disturbing behavior.
>>>> >
>>>> > The bug, apparently, lies in that `application:ensure_all_started(A)`
>>>> ends up busy-looping if A depends on B, and permission(B) -> false. What's
>>>> worse, for each call to start(B), the application controller notices the
>>>> permission flag, returns `ok` and inserts an entry in its internal
>>>> `start_p_false` list. This amounts to a memory leak.
>>>> >
>>>> > I commented it in a tweet, then decided to try to find the source,
>>>> esp. since I suspected `application:ensure_all_started/1`.
>>>> >
>>>> > https://twitter.com/uwiger/status/1372944356781531136
>>>> >
>>>> > In short, if permission(B) -> false, what happens is:
>>>> > start(A) -> {error, {not_started, B}}
>>>> > start(B) -> ok
>>>> > start(A) -> {error,  {not_started, B}}
>>>> > ... [repeat endlessly]
>>>> >
>>>> > Now, it could be fixed by adding a permission check in the looping
>>>> function, but this raises the question of what should happen in the above
>>>> case. Three alternatives:
>>>> >
>>>> > 1. ensure_all_started(A) returns {error, {not_permitted, B}}, or
>>>> something
>>>> > 2. the call hangs until the flag(s) change, but start(B) is only
>>>> called once.
>>>> > 3. Warn against the use of permissions in the docs, and deprecate
>>>> them.
>>>> >
>>>> > I'm assuming that most of you may not even know about permissions.
>>>> They were introduced back in 1996-97 (I believe), when I and Martin
>>>> Björklund were going back and forth on how to support distributed
>>>> applications and cluster control. Eventually, this led to dist_ac and the
>>>> protocol being defined, so that users could write a controller app taking
>>>> control of an application and giving instructions on where it should run.
>>>> In the AXD301, this was done by the RCM application. I believe I talked
>>>> about it at the EUC 1997, but it's hard to find information about that on
>>>> the web. :)
>>>> >
>>>> > Anyway, permissions were left in the API, and ARE documented.
>>>> >
>>>> > Thoughts?
>>>>
>>>> I know we've used the permissions mechanism occasionally during
>>>> maintenance or live upgrades. Off-hand I don't know if we'd want
>>>> alternative 1 or 2 (my colleague Daniel Szoboszlay might know more
>>>> about this).
>>>>
>>>> /Mikael
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20210323/192a1c20/attachment.htm>


More information about the erlang-questions mailing list