<div dir="ltr"><div>We've noticed interesting things with RabbitMQ locking up while trying to log data (with Lager) after massive disconnect storms where thousands of connections suddenly vanish.</div><div>The issue is however not related to RabbitMQ itself.</div><div><br></div><div>The set-up for the logging is essentially two gen_events talking to each other:</div><div><ul><li>There's a process called <span style="font-family:monospace">rabbit_log_connection_lager_event</span> that just receives log messages and forwards them to <span style="font-family:monospace">lager_event</span>; this forward can either be sync or async, but has no control flow going on (<a href="https://github.com/rabbitmq/rabbitmq-common/blob/v3.8.5/src/lager_forwarder_backend.erl#L109">https://github.com/rabbitmq/rabbitmq-common/blob/v3.8.5/src/lager_forwarder_backend.erl#L109</a>) <br></li><li>The sync or async setting is actually handled by its target process (<span style="font-family:monospace">lager_event</span>) dynamically setting that value based on overload <a href="https://github.com/erlang-lager/lager/blob/06f676aad8ed823b7838a70288cd4d13b5a25ede/src/lager_backend_throttle.erl#L68">https://github.com/erlang-lager/lager/blob/06f676aad8ed823b7838a70288cd4d13b5a25ede/src/lager_backend_throttle.erl#L68</a>)</li><li>The rabbitmq's <span style="font-family:monospace">gen_event</span> only has the one handler above in it (<span style="font-family:monospace">lager_forwarder_backend</span>)</li><li>The lager <span style="font-family:monospace">gen_event</span> has two handlers in it, the <span style="font-family:monospace">lager_console_backend</span> (<a href="https://github.com/erlang-lager/lager/blob/3.8.0/src/lager_console_backend.erl">https://github.com/erlang-lager/lager/blob/3.8.0/src/lager_console_backend.erl</a>) and the <span style="font-family:monospace">lager_backend_throttle</span> (<a href="https://github.com/erlang-lager/lager/blob/3.8.0/src/lager_backend_throttle.erl">https://github.com/erlang-lager/lager/blob/3.8.0/src/lager_backend_throttle.erl</a>)</li><li>The sync_notify/async implementation in <span style="font-family:monospace">gen_event</span> is done at a lower level that relies on whether all the log handlers' calls have returned or not before sending an acknowledgement: <a href="https://github.com/erlang/otp/blob/OTP-22.0/lib/stdlib/src/gen_event.erl#L560">https://github.com/erlang/otp/blob/OTP-22.0/lib/stdlib/src/gen_event.erl#L560</a> called by <a href="https://github.com/erlang/otp/blob/OTP-22.0/lib/stdlib/src/gen_event.erl#L302-L305">https://github.com/erlang/otp/blob/OTP-22.0/lib/stdlib/src/gen_event.erl#L302-L305</a>. Any failure is going to be caught at the <span style="font-family:monospace">gen_event</span> level. Do note that the synchronous call in <span style="font-family:monospace">gen_event:sync_notify()</span> calls <span style="font-family:monospace">gen_event:rpc()</span>, which in turns calls <span style="font-family:monospace">gen:call()</span> with an <span style="font-family:monospace">infinity</span> timeout. There is no way to configure this value<br></li><li>There's not any place in there that seems like it could be snooping in the mailbox and dropping messages it shouldn't<br></li></ul></div><div>So here are the observations made on the system:</div><div><ul><li>There are 18,000 messages stuck in the rabbitmq's gen_event:<br><span style="font-family:monospace">(rabbit@rabbitmq-1-0)2> recon:proc_count(message_queue_len, 10).<br>[{<0.127.0>,18859,<br>  [rabbit_log_connection_lager_event,<br>   {current_function,{gen,do_call,4}},<br>   {initial_call,{proc_lib,init_p,5}}]},<br> ...</span></li><li>The <span style="font-family:monospace">gen_event</span> is stuck in <span style="font-family:monospace">gen:do_call/4</span>, which has an infinite timeout:<br><span style="font-family:monospace"> {location,[{initial_call,{proc_lib,init_p,5}},<br>            {current_stacktrace,[{gen,do_call,4,<br>                                      [{file,"gen.erl"},{line,167}]},<br>                                 {gen_event,rpc,2,[{file,"gen_event.erl"},{line,239}]},<br>                                 {lager_forwarder_backend,handle_event,2,<br>                                                          [{file,"src/lager_forwarder_backend.erl"},{line,109}]},<br>                                 {gen_event,server_update,4,<br>                                            [{file,"gen_event.erl"},{line,577}]},<br>                                 {gen_event,server_notify,4,<br>                                            [{file,"gen_event.erl"},{line,559}]},<br>                                 {gen_event,handle_msg,6,[{file,"gen_event.erl"},{line,303}]},<br>                                 {proc_lib,init_p_do_apply,3,<br>                                           [{file,"proc_lib.erl"},{line,249}]}]}]},</span></li><li>By looking at the processes monitored by the RabbitMQ <span style="font-family:monospace">gen_event</span>, I find that it's blocked on <span style="font-family:monospace">lager_event</span> (within <span style="font-family:monospace">lager</span>), which has an empty mailbox and is just patiently waiting for a message:<br> <span style="font-family:monospace">{location,[{initial_call,{proc_lib,init_p,5}},<br>            {current_stacktrace,[{gen_event,fetch_msg,6,<br>                                            [{file,"gen_event.erl"},{line,280}]},<br>                                 {proc_lib,init_p_do_apply,3,<br>                                           [{file,"proc_lib.erl"},{line,249}]}]}]},<br>...</span><br></li></ul><div><br></div><div>So clearly, the <span style="font-family:monospace">sync_notify</span> call was made (hence we see the <span style="font-family:monospace">gen_event:rpc()</span> call in the stacktrace), is not active on the receiving end (which is waiting to receive something else). The infinity timeout leaves us in a locked pattern with an ever-growing queue.</div><div><br></div><div>We have managed to reproduce this twice after hours of work, but currently don't have living instances suffering the issue.<br></div><div><br></div></div><div>The two things I have as a theory for now is either:</div><div><ol><li>the reply was received but not processed (the bug at <a href="https://bugs.erlang.org/browse/ERL-1076">https://bugs.erlang.org/browse/ERL-1076</a> could be at play but the bug report's optimization requires a conditional that doesn't match the format of <span style="font-family:monospace">gen:call()</span> here)</li><li>the reply was "sent" but never actually enqueued in the destination process</li></ol><div>I am reminded of a problem we had on some older Erlang instances at an older job where I node would suddenly lock up entirely; the pattern we'd see was the same but within the groups and io:put_chars never receiving an IO protocol confirmation that the data had left the system. The process was stuck waiting for a response which the other process either already sent or never sent (it was also in a waiting state stuck in a receive according to its stacktrace).</div><div><br></div><div>At this point I'm not quite sure where to look. It seems that under some very rare scenarios, message passing isn't quite behaving right. A timeout in the code would hide that fact and let the system eventually recover, but the specific pattern at hand in gen_event.erl does not allow for it.</div><div>Nothing in the Erlang logic on either side seems to be able to properly explain the issue.</div><div><br></div><div>I'm also nervous to call out the VM's message enqueuing (or dequeuing) mechanism as possibly responsible for this bug, but maybe someone else has a different reading from mine?</div><div><br></div><div>Regards,</div><div>Fred.<br></div></div></div>