Logger under load

Vladislav Titov me@REDACTED
Fri Aug 7 22:33:16 CEST 2020


Another approach is to use configured rate-limit:

1. There is a token bucket stored in ETS.
2. The handler process periodically refills it. It does a send_after
to itself to schedule these refills.
3. If the handler's queue does build up, the refill message would get
delayed, which reduces the refill rate. (I *think*!)
4. Logging processes have to obtain a token to actually send the
message (with update_counter).

Our log events go into downstream logging infrastructure via syslog,
so this has the very important advantage of protecting that from

Admittedly this does create lock contention on ETS for getting tokens
out of the bucket, but that seems to be OK for low 10Ks of messages
per second.

If the rate limit is hit, the log message is insert_new into ETS keyed
by {module, severity, rand() rem 10}. The logging system periodically
empties (and logs) the contents of this table.

This way even if we're dropping logs (which happens when part of the
system breaks), infrequent messages from other parts of the system
still show up.

Our log macro also takes an optional deduplication term (usually
module and some sort of rough context). Log producer checks ETS for
the deduplication term and drops if the same kind of message was
emitted recently. This prevents a flood of almost-identical "user X's
request has failed because timeout" messages and the like.


On Fri, Aug 7, 2020 at 7:08 PM Michael Truog <mjtruog@REDACTED> wrote:
> On 8/7/20 5:25 AM, Michael Uvarov wrote:
> > Hi,
> >
> > Sync/async logging was a common issue with lager too. I highly suggest
> > not to use sync logging to avoid extra timeout issues which is pretty
> > hard to debug.
> >
> > Dropping the whole sinks (the whole pending logger message queue) was
> > an option for lager.
> >
> > Better option would be to count how many events are logged from each
> > location and start skipping log events from too chatty locations (and
> > log how many events were dropped accordingly). But I haven't seen such
> > "smart" code anywhere yet.
> That is a good situation to use the process dictionary for determining
> which Erlang processes are attempting to overload the logger.  Also, if
> the sync logger message uses an infinity timeout, the {Ref, _} message
> is not a problem.  The CloudI logger has used this approach to handle
> large amounts of load (
> https://github.com/CloudI/CloudI/blob/1128bba30404a3502a465ca636ba2ee02db544ce/src/lib/cloudi_core/src/cloudi_core_i_logger.erl#L1173
> ).
> Best Regards,
> Michael

More information about the erlang-questions mailing list