Supervisor got noproc (looks like a bug)

Thu Sep 9 14:20:35 CEST 2021

We are getting a variant of this sometimes when shutting down RabbitMQ:

2021-09-08 14:14:08.886 [error] <0.4805.15> Supervisor {<0.4805.15>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error
2021-09-08 14:14:08.887 [info] <0.29845.15> supervisor: {<0.29845.15>,rabbit_channel_sup_sup}, errorContext: shutdown_error, reason: shutdown, offender: [{nb_children,1},{id,channel_sup},{mfargs,{rabbit_channel_sup,start_link,[]}},{restart_type,temporary},{shutdown,infinity},{child_type,supervisor}]

This happens because channels are automatically closed when the relevant connection goes away. Because we are shutting down, all connections go away, channels shut themselves down automatically with reason shutdown, while at the same time the application is stopping itself and eventually it tries to shut down a channel that is itself in the process of shutting down.

The supervisor monitors, unlinks, then receives the shutdown 'DOWN' message, just before it would send its own exit(Pid, shutdown).

Perhaps there should be another clause here to exclude the shutdown/noproc reasons? It doesn't accomplish much other than polluting logs.

https://github.com/erlang/otp/blob/master/lib/stdlib/src/supervisor.erl#L952-L953

Cheers,

Loïc Hoguin

On 07/09/2021 19:21, "erlang-questions on behalf of Alexander Petrovsky" <erlang-questions-bounces@REDACTED on behalf of askjuise@REDACTED> wrote:

    Hello!

    Recently, during the performance test, I noticed - when the supervisor
    tries to terminate children sometimes it's got 'noproc'. My erlang
    version - OTP-23.2.5.

    Accordingly, to some circumstances, the child
    ('gen_server:start_link/3' process) decides to stop with '{stop,
    normal, State}'. At the same time, another process decides to stop
    this child with 'supervisor:terminate_child/2'.

    In some (rare, but reproducible) cases I've got in logs:

       supervisor: {local,my_sup}
       errorContext: shutdown_error
       reason: noproc
       offender: [{pid,<0.4890.1539>},
                  {id,my},
                  {mfargs,{my,start_link,[]}},
                  {restart_type,temporary},
                  {shutdown,5000},
                  {child_type,worker}]
    (supervisor:do_terminate/2)

    Accordingly to https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ferlang%2Fotp%2Fblob%2FOTP-23.2.5%2Flib%2Fstdlib%2Fsrc%2Fsupervisor.erl%23L873-L919&data=04%7C01%7Clhoguin%40vmware.com%7Ce81a78dc06984189dccf08d97223ee39%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637666320924891115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5pFBdThPfXeRT90i6%2BWucVjpE2kIgdxaTNSkNOad4yI%3D&reserved=0
    this could happen when: child terminates with 'noproc' reason or
    "child did unlink and the child dies before monitor". But the
    termination reason in all cases is '{stop, normal, State}' and the
    children don't do any 'erlang:unlink/1' at all!

    I've gather the trace with:
    dbg:tracer(process, {fun(Msg,_) -> syslog:info_msg("Trace ~0p~n",
    [Msg]), 0 end, 0}).
    dbg:p(whereis(my_sup), [m, c]).
    dbg:tp(erlang,exit,[{'_',[],[{return_trace}]}]).
    dbg:tp(erlang,unlink,[{'_',[],[{return_trace}]}]).
    dbg:tp(erlang,monitor,[{'_',[],[{return_trace}]}]).
    dbg:p(all, c).

    And I see the different flows! When the supervisor got 'noproc' the
    flow looks like:

    2021-09-07T15:53:56.431343Z <0.5314.0> my: started
    2021-09-07T15:53:56.433941Z Trace
    {trace,<0.1089.0>,'receive',{ack,<0.5314.0>,{ok,<0.5314.0>}}}
    2021-09-07T15:53:56.434081Z Trace
    {trace,<0.1089.0>,send,{#Ref<0.1919367765.3881566209.208162>,{ok,<0.5314.0>}},<0.5284.0>}
    2021-09-07T15:53:56.442202Z <0.5314.0> my: finished
    2021-09-07T15:53:56.447844Z Trace
    {trace,<0.1089.0>,'receive',{'$gen_call',{<0.5284.0>,#Ref<0.1919367765.3881566210.214548>},{terminate_child,<0.5314.0>}}}
    2021-09-07T15:53:56.543227Z Trace
    {trace,<0.1089.0>,call,{erlang,monitor,[process,<0.5314.0>]}}
    2021-09-07T15:53:56.543352Z Trace
    {trace,<0.1089.0>,return_from,{erlang,monitor,2},#Ref<0.1919367765.3881566212.211028>}
    2021-09-07T15:53:56.543467Z Trace
    {trace,<0.1089.0>,call,{erlang,unlink,[<0.5314.0>]}}
    2021-09-07T15:53:56.543676Z Trace
    {trace,<0.1089.0>,'receive',{'DOWN',#Ref<0.1919367765.3881566212.211028>,process,<0.5314.0>,noproc}}
    2021-09-07T15:53:56.543854Z Trace
    {trace,<0.1089.0>,call,{erlang,exit,[<0.5314.0>,shutdown]}}

    In all other normal cases flow looks lie:

    2021-09-07T15:54:24.489280Z <0.6838.0> my: started
    2021-09-07T15:54:24.491185Z Trace
    {trace,<0.1089.0>,'receive',{ack,<0.6838.0>,{ok,<0.6838.0>}}}
    2021-09-07T15:54:24.491344Z Trace
    {trace,<0.1089.0>,send,{#Ref<0.1919367765.3881566210.231347>,{ok,<0.6838.0>}},<0.6809.0>}
    2021-09-07T15:54:24.514841Z <0.6838.0> my: finished
    2021-09-07T15:54:24.529580Z Trace
    {trace,<0.1089.0>,'receive',{'$gen_call',{<0.6809.0>,#Ref<0.1919367765.3881566209.224355>},{terminate_child,<0.6838.0>}}}
    2021-09-07T15:54:24.616316Z Trace
    {trace,<0.1089.0>,'receive',{'EXIT',<0.6838.0>,normal}}
    2021-09-07T15:54:24.618596Z Trace
    {trace,<0.1089.0>,call,{erlang,monitor,[process,<0.6838.0>]}}
    2021-09-07T15:54:24.618760Z Trace
    {trace,<0.1089.0>,call,{erlang,unlink,[<0.6838.0>]}}
    2021-09-07T15:54:24.619175Z Trace
    {trace,<0.1089.0>,'receive',{'DOWN',#Ref<0.1919367765.3881566212.227125>,process,<0.6838.0>,noproc}}
    2021-09-07T15:54:24.619275Z Trace
    {trace,<0.1089.0>,send,{#Ref<0.1919367765.3881566209.224355>,ok},<0.6809.0>}

    Does it seem like the very tricky race or like erlang lost EXIT messages?

    -- 
    Alexander Petrovsky