[erlang-questions] Diagnosing gen_server call timeouts

Siraaj Khandkar siraaj@REDACTED
Thu Sep 19 20:42:11 CEST 2019


On Thu, Sep 19, 2019 at 1:33 PM Roger Lipscombe <roger@REDACTED> wrote:
>
> I've got a gen_server:call that -- very occasionally -- suffers from a
> timeout. Obviously this means that my gen_server is already busy doing
> something else.
>
> Is there a way that I can instrument gen_server so that it will log
> something if it takes too long to return from a callback? That is: my
> handle_call is timing out because, presumably, there's another
> handle_call (or handle_info, etc.) that's blocking. I'd like *that* to
> be logged if it takes longer than, say, 200 milliseconds to return.
>
> Or, I guess, if the message queue length is excessive at any point.
>
> Caveat: I've got ~20,000 of these gen_server processes, and this only
> happens intermittently in production, so I'm thinking that tracing
> *isn't* the right answer here.

Idea from an ignorant (of built-in features) point of view: call handler
begins by sends a `{call_started, ID, Other_Interesting_Info}` to a
bookkeeping process, then, before replying, `{call_completed, ID}`.
Bookkeeper thus always knows the current in-progress calls and the stats
for completed ones, so it can now log whichever ones violated your
ostensible SLA. This sounds awfully similar to just posting timer
start/end messages to StatsD or some such, but is more flexible.



More information about the erlang-questions mailing list