The Mystery of the Vanishing Message's Dead Lock

Thu Jul 23 18:18:06 CEST 2020

We've noticed interesting things with RabbitMQ locking up while trying to
log data (with Lager) after massive disconnect storms where thousands of
connections suddenly vanish.
The issue is however not related to RabbitMQ itself.

The set-up for the logging is essentially two gen_events talking to each
other:

   - There's a process called rabbit_log_connection_lager_event that just
   receives log messages and forwards them to lager_event; this forward can
   either be sync or async, but has no control flow going on (
   https://github.com/rabbitmq/rabbitmq-common/blob/v3.8.5/src/lager_forwarder_backend.erl#L109)

   - The sync or async setting is actually handled by its target process (
   lager_event) dynamically setting that value based on overload
   https://github.com/erlang-lager/lager/blob/06f676aad8ed823b7838a70288cd4d13b5a25ede/src/lager_backend_throttle.erl#L68
   )
   - The rabbitmq's gen_event only has the one handler above in it (
   lager_forwarder_backend)
   - The lager gen_event has two handlers in it, the lager_console_backend (
   https://github.com/erlang-lager/lager/blob/3.8.0/src/lager_console_backend.erl)
   and the lager_backend_throttle (
   https://github.com/erlang-lager/lager/blob/3.8.0/src/lager_backend_throttle.erl
   )
   - The sync_notify/async implementation in gen_event is done at a lower
   level that relies on whether all the log handlers' calls have returned or
   not before sending an acknowledgement:
   https://github.com/erlang/otp/blob/OTP-22.0/lib/stdlib/src/gen_event.erl#L560
   called by
   https://github.com/erlang/otp/blob/OTP-22.0/lib/stdlib/src/gen_event.erl#L302-L305.
   Any failure is going to be caught at the gen_event level. Do note that
   the synchronous call in gen_event:sync_notify() calls gen_event:rpc(),
   which in turns calls gen:call() with an infinity timeout. There is no
   way to configure this value
   - There's not any place in there that seems like it could be snooping in
   the mailbox and dropping messages it shouldn't

So here are the observations made on the system:

   - There are 18,000 messages stuck in the rabbitmq's gen_event:
   (rabbit@REDACTED)2> recon:proc_count(message_queue_len, 10).
   [{<0.127.0>,18859,
     [rabbit_log_connection_lager_event,
      {current_function,{gen,do_call,4}},
      {initial_call,{proc_lib,init_p,5}}]},
    ...
   - The gen_event is stuck in gen:do_call/4, which has an infinite timeout:
    {location,[{initial_call,{proc_lib,init_p,5}},
               {current_stacktrace,[{gen,do_call,4,
                                         [{file,"gen.erl"},{line,167}]},

    {gen_event,rpc,2,[{file,"gen_event.erl"},{line,239}]},
                                    {lager_forwarder_backend,handle_event,2,

   [{file,"src/lager_forwarder_backend.erl"},{line,109}]},
                                    {gen_event,server_update,4,

   [{file,"gen_event.erl"},{line,577}]},
                                    {gen_event,server_notify,4,

   [{file,"gen_event.erl"},{line,559}]},

    {gen_event,handle_msg,6,[{file,"gen_event.erl"},{line,303}]},
                                    {proc_lib,init_p_do_apply,3,

    [{file,"proc_lib.erl"},{line,249}]}]}]},
   - By looking at the processes monitored by the RabbitMQ gen_event, I
   find that it's blocked on lager_event (within lager), which has an empty
   mailbox and is just patiently waiting for a message:
    {location,[{initial_call,{proc_lib,init_p,5}},
               {current_stacktrace,[{gen_event,fetch_msg,6,

   [{file,"gen_event.erl"},{line,280}]},
                                    {proc_lib,init_p_do_apply,3,

    [{file,"proc_lib.erl"},{line,249}]}]}]},
   ...

So clearly, the sync_notify call was made (hence we see the gen_event:rpc()
call in the stacktrace), is not active on the receiving end (which is
waiting to receive something else). The infinity timeout leaves us in a
locked pattern with an ever-growing queue.

We have managed to reproduce this twice after hours of work, but currently
don't have living instances suffering the issue.

The two things I have as a theory for now is either:

   1. the reply was received but not processed (the bug at
   https://bugs.erlang.org/browse/ERL-1076 could be at play but the bug
   report's optimization requires a conditional that doesn't match the format
   of gen:call() here)
   2. the reply was "sent" but never actually enqueued in the destination
   process

I am reminded of a problem we had on some older Erlang instances at an
older job where I node would suddenly lock up entirely; the pattern we'd
see was the same but within the groups and io:put_chars never receiving an
IO protocol confirmation that the data had left the system. The process was
stuck waiting for a response which the other process either already sent or
never sent (it was also in a waiting state stuck in a receive according to
its stacktrace).

At this point I'm not quite sure where to look. It seems that under some
very rare scenarios, message passing isn't quite behaving right. A timeout
in the code would hide that fact and let the system eventually recover, but
the specific pattern at hand in gen_event.erl does not allow for it.
Nothing in the Erlang logic on either side seems to be able to properly
explain the issue.

I'm also nervous to call out the VM's message enqueuing (or dequeuing)
mechanism as possibly responsible for this bug, but maybe someone else has
a different reading from mine?

Regards,
Fred.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20200723/46ca82c7/attachment.htm>