Supervisor got noproc (looks like a bug)
Loïc Hoguin
lhoguin@REDACTED
Thu Sep 9 14:20:35 CEST 2021
We are getting a variant of this sometimes when shutting down RabbitMQ:
2021-09-08 14:14:08.886 [error] <0.4805.15> Supervisor {<0.4805.15>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error
2021-09-08 14:14:08.887 [info] <0.29845.15> supervisor: {<0.29845.15>,rabbit_channel_sup_sup}, errorContext: shutdown_error, reason: shutdown, offender: [{nb_children,1},{id,channel_sup},{mfargs,{rabbit_channel_sup,start_link,[]}},{restart_type,temporary},{shutdown,infinity},{child_type,supervisor}]
This happens because channels are automatically closed when the relevant connection goes away. Because we are shutting down, all connections go away, channels shut themselves down automatically with reason shutdown, while at the same time the application is stopping itself and eventually it tries to shut down a channel that is itself in the process of shutting down.
The supervisor monitors, unlinks, then receives the shutdown 'DOWN' message, just before it would send its own exit(Pid, shutdown).
Perhaps there should be another clause here to exclude the shutdown/noproc reasons? It doesn't accomplish much other than polluting logs.
https://github.com/erlang/otp/blob/master/lib/stdlib/src/supervisor.erl#L952-L953
Cheers,
Loïc Hoguin
On 07/09/2021 19:21, "erlang-questions on behalf of Alexander Petrovsky" <erlang-questions-bounces@REDACTED on behalf of askjuise@REDACTED> wrote:
Hello!
Recently, during the performance test, I noticed - when the supervisor
tries to terminate children sometimes it's got 'noproc'. My erlang
version - OTP-23.2.5.
Accordingly, to some circumstances, the child
('gen_server:start_link/3' process) decides to stop with '{stop,
normal, State}'. At the same time, another process decides to stop
this child with 'supervisor:terminate_child/2'.
In some (rare, but reproducible) cases I've got in logs:
supervisor: {local,my_sup}
errorContext: shutdown_error
reason: noproc
offender: [{pid,<0.4890.1539>},
{id,my},
{mfargs,{my,start_link,[]}},
{restart_type,temporary},
{shutdown,5000},
{child_type,worker}]
(supervisor:do_terminate/2)
Accordingly to https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ferlang%2Fotp%2Fblob%2FOTP-23.2.5%2Flib%2Fstdlib%2Fsrc%2Fsupervisor.erl%23L873-L919&data=04%7C01%7Clhoguin%40vmware.com%7Ce81a78dc06984189dccf08d97223ee39%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637666320924891115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5pFBdThPfXeRT90i6%2BWucVjpE2kIgdxaTNSkNOad4yI%3D&reserved=0
this could happen when: child terminates with 'noproc' reason or
"child did unlink and the child dies before monitor". But the
termination reason in all cases is '{stop, normal, State}' and the
children don't do any 'erlang:unlink/1' at all!
I've gather the trace with:
dbg:tracer(process, {fun(Msg,_) -> syslog:info_msg("Trace ~0p~n",
[Msg]), 0 end, 0}).
dbg:p(whereis(my_sup), [m, c]).
dbg:tp(erlang,exit,[{'_',[],[{return_trace}]}]).
dbg:tp(erlang,unlink,[{'_',[],[{return_trace}]}]).
dbg:tp(erlang,monitor,[{'_',[],[{return_trace}]}]).
dbg:p(all, c).
And I see the different flows! When the supervisor got 'noproc' the
flow looks like:
2021-09-07T15:53:56.431343Z <0.5314.0> my: started
2021-09-07T15:53:56.433941Z Trace
{trace,<0.1089.0>,'receive',{ack,<0.5314.0>,{ok,<0.5314.0>}}}
2021-09-07T15:53:56.434081Z Trace
{trace,<0.1089.0>,send,{#Ref<0.1919367765.3881566209.208162>,{ok,<0.5314.0>}},<0.5284.0>}
2021-09-07T15:53:56.442202Z <0.5314.0> my: finished
2021-09-07T15:53:56.447844Z Trace
{trace,<0.1089.0>,'receive',{'$gen_call',{<0.5284.0>,#Ref<0.1919367765.3881566210.214548>},{terminate_child,<0.5314.0>}}}
2021-09-07T15:53:56.543227Z Trace
{trace,<0.1089.0>,call,{erlang,monitor,[process,<0.5314.0>]}}
2021-09-07T15:53:56.543352Z Trace
{trace,<0.1089.0>,return_from,{erlang,monitor,2},#Ref<0.1919367765.3881566212.211028>}
2021-09-07T15:53:56.543467Z Trace
{trace,<0.1089.0>,call,{erlang,unlink,[<0.5314.0>]}}
2021-09-07T15:53:56.543676Z Trace
{trace,<0.1089.0>,'receive',{'DOWN',#Ref<0.1919367765.3881566212.211028>,process,<0.5314.0>,noproc}}
2021-09-07T15:53:56.543854Z Trace
{trace,<0.1089.0>,call,{erlang,exit,[<0.5314.0>,shutdown]}}
In all other normal cases flow looks lie:
2021-09-07T15:54:24.489280Z <0.6838.0> my: started
2021-09-07T15:54:24.491185Z Trace
{trace,<0.1089.0>,'receive',{ack,<0.6838.0>,{ok,<0.6838.0>}}}
2021-09-07T15:54:24.491344Z Trace
{trace,<0.1089.0>,send,{#Ref<0.1919367765.3881566210.231347>,{ok,<0.6838.0>}},<0.6809.0>}
2021-09-07T15:54:24.514841Z <0.6838.0> my: finished
2021-09-07T15:54:24.529580Z Trace
{trace,<0.1089.0>,'receive',{'$gen_call',{<0.6809.0>,#Ref<0.1919367765.3881566209.224355>},{terminate_child,<0.6838.0>}}}
2021-09-07T15:54:24.616316Z Trace
{trace,<0.1089.0>,'receive',{'EXIT',<0.6838.0>,normal}}
2021-09-07T15:54:24.618596Z Trace
{trace,<0.1089.0>,call,{erlang,monitor,[process,<0.6838.0>]}}
2021-09-07T15:54:24.618760Z Trace
{trace,<0.1089.0>,call,{erlang,unlink,[<0.6838.0>]}}
2021-09-07T15:54:24.619175Z Trace
{trace,<0.1089.0>,'receive',{'DOWN',#Ref<0.1919367765.3881566212.227125>,process,<0.6838.0>,noproc}}
2021-09-07T15:54:24.619275Z Trace
{trace,<0.1089.0>,send,{#Ref<0.1919367765.3881566209.224355>,ok},<0.6809.0>}
Does it seem like the very tricky race or like erlang lost EXIT messages?
--
Alexander Petrovsky
More information about the erlang-questions
mailing list