BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Tue Mar 23 13:13:04 CET 2021

I initially implemented `application:ensure_all_started/1,2' because it
made for an easier get-started mechanism than getting a full blown release,
and avoided having people do the annoying "try until it works" garbage
routine by hand (which the function is now doing), or having to just line
up all the start calls at the beginning of a test run.

I would be in favour of just letting it fail and return back {error, {b,
{permit, false}}} -- there are more variations we could use, but this term
allows to:
- specifically mention it's b that isn't allowed to start
- uses the 'permit' keyword that matches the function name, such that
googling something like "erlang permit false" will yield things such as the
doc page for application:permit/2 (or this thread, given how google
indexing works)

That being said, it does not line up with the current API usage, which
would probably need some fixing of some sort? Look at this session:

1> application:load(ssl).
ok
2> application:permit(ssl, false).
ok
3> application:ensure_all_started(crypto).
{ok,[crypto]}
4> application:ensure_all_started(public_key).
{ok,[asn1,public_key]}
5> application:start(ssl).
ok
6> application:stop(ssl).
{error,{not_started,ssl}}

Starting the application "works" even if it does not with permissions; the
call silently fails while returning it succeeded. Since the current
documentation for ensure_all_started states the following:

> The function reports {error, {AppName,Reason}} for errors, where Reason
> is any possible reason returned by start/1,2
> <http://erlang.org/doc/apps/kernel/application.html#start-1> when
> starting a specific dependency.
>
We are in a situation where ensure_all_started can't or won't know that
permissions are at play (I didn't when I implemented it), and it appears
that we would need to do an explicit permission check before each run to
properly return the error explaining the issue without breaking the
compatibility of start/1,2. The gotcha here is that there's apparently no
way to access this status aside from application_controller internals that
would need some extra visibility, which start/1,2 calls on its own as well.

Regards,
Fred.

On Mon, Mar 22, 2021 at 1:06 PM Ulf Wiger <ulf@REDACTED> wrote:

> Hmm, trying some more with OTP 24, it addresses the problem with the
> memory growth, but still isn't permission-aware.
>
> Consider test apps a and b, where a depends on b.
>
> 15> application:permit(b,false).
> ok
> 16> application:ensure_all_started(a).
> {error,{a,{not_started,b}}}
> 17> application:which_applications().
> [{stdlib,"ERTS  CXC 138 10","3.15"},
>  {kernel,"ERTS  CXC 138 10","8.0"}]
> 18> application:permit(b,true).
> ok
> 19> application:which_applications().
> [{b,"test app","0.1"},
>  {stdlib,"ERTS  CXC 138 10","3.15"},
>  {kernel,"ERTS  CXC 138 10","8.0"}]
>
> The call to application:ensure_all_started(a) fails, and supposedly all
> child apps that were started will have been stopped again, and it does look
> that way.
>
> But if we later permit b to run, it turns out that the start request
> wasn't actually removed, and b pops up.
>
> This is for sure a much less serious problem than the previous one.
>
> However, I'm not sure if returning error is actually the right thing to do
> there. The call SHOULD probably hang.
>
> Comments?
>
> BR,
> Ulf W
>
> On Mon, Mar 22, 2021 at 1:23 PM Ulf Wiger <ulf@REDACTED> wrote:
>
>> When I started looking closer into this, it would appear as if there is a
>> long-standing bug in the application_controller regarding permissions.
>>
>> And with "long-standing" I mean that it was there even when Kostis did
>> some Tidier-based cleanup 11 years ago. Kostis didn't introduce it, though.
>>
>> When servicing a start request, the application_controller, if
>> permission(App) == false, adds a new entry to the `start_p_false` list,
>> i.e. a new entry for each request.
>>
>> https://github.com/erlang/otp/blob/master/lib/kernel/src/application_controller.erl#L689-L690
>>
>> ... but when servicing a subsequent {permit_application, App, true}, it
>> uses lists:keydelete/3 to remove the App from the `start_p_false` list.
>>
>> https://github.com/erlang/otp/blob/master/lib/kernel/src/application_controller.erl#L759-L761
>>
>> lists:keydelete/3 obviously only removes the first matching entry.
>>
>> Earlier in that function, it also only locates the first pending request
>> (or rather, chronologically the last), and uses the `From` in
>> `spawn_starter()'.
>>
>> The rest of the pending requests should be handled somewhere - likely in
>> `handle_application_started/3`, but aren't.
>>
>> BR,
>> Ulf W
>>
>> On Sat, Mar 20, 2021 at 1:45 PM Mikael Pettersson <mikpelinux@REDACTED>
>> wrote:
>>
>>> On Sat, Mar 20, 2021 at 8:46 AM Ulf Wiger <ulf@REDACTED> wrote:
>>> >
>>> > I had the brilliant idea of using application permissions for a
>>> particular use case. This seemed to work perfectly, until I ran `rebar3
>>> shell`, and spotted some disturbing behavior.
>>> >
>>> > The bug, apparently, lies in that `application:ensure_all_started(A)`
>>> ends up busy-looping if A depends on B, and permission(B) -> false. What's
>>> worse, for each call to start(B), the application controller notices the
>>> permission flag, returns `ok` and inserts an entry in its internal
>>> `start_p_false` list. This amounts to a memory leak.
>>> >
>>> > I commented it in a tweet, then decided to try to find the source,
>>> esp. since I suspected `application:ensure_all_started/1`.
>>> >
>>> > https://twitter.com/uwiger/status/1372944356781531136
>>> >
>>> > In short, if permission(B) -> false, what happens is:
>>> > start(A) -> {error, {not_started, B}}
>>> > start(B) -> ok
>>> > start(A) -> {error,  {not_started, B}}
>>> > ... [repeat endlessly]
>>> >
>>> > Now, it could be fixed by adding a permission check in the looping
>>> function, but this raises the question of what should happen in the above
>>> case. Three alternatives:
>>> >
>>> > 1. ensure_all_started(A) returns {error, {not_permitted, B}}, or
>>> something
>>> > 2. the call hangs until the flag(s) change, but start(B) is only
>>> called once.
>>> > 3. Warn against the use of permissions in the docs, and deprecate them.
>>> >
>>> > I'm assuming that most of you may not even know about permissions.
>>> They were introduced back in 1996-97 (I believe), when I and Martin
>>> Björklund were going back and forth on how to support distributed
>>> applications and cluster control. Eventually, this led to dist_ac and the
>>> protocol being defined, so that users could write a controller app taking
>>> control of an application and giving instructions on where it should run.
>>> In the AXD301, this was done by the RCM application. I believe I talked
>>> about it at the EUC 1997, but it's hard to find information about that on
>>> the web. :)
>>> >
>>> > Anyway, permissions were left in the API, and ARE documented.
>>> >
>>> > Thoughts?
>>>
>>> I know we've used the permissions mechanism occasionally during
>>> maintenance or live upgrades. Off-hand I don't know if we'd want
>>> alternative 1 or 2 (my colleague Daniel Szoboszlay might know more
>>> about this).
>>>
>>> /Mikael
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20210323/5a68a3a3/attachment.htm>