[erlang-questions] Measuring message queue delay

Fri May 1 00:09:13 CEST 2015

While I don’t have a canned solution for ‘always on’ monitoring of message queue delay, Erlang’s trace functionality would let you accurately determine the ‘receive -> handle’ message queue delay. It’s one of the most accurate ways to measure because the Erlang VM itself Timestamps the trace messages (you don’t have to worry about jitter/inaccuracy introduced by needing to capture the timestamp in erlang code).

You’d need to build a process that handles messages like:
* {trace, Pid, receive, {‘$gen_call’, {_Sender,Ref},_Call}, Timestamp} (t0 for a given gen_server:call(Pid, _Call) from Sender at Timestamp)
* {trace, Pid, call, {your_module, handle_call, [_Call, {_Sender, Ref}, _State], Timestamp}

You’d create an ETS table and insert {{Pid, Ref}, Timestamp} when you see the receive message, then read that back out when you see a ‘call’ trace message to {Pid, Ref} (The Pid and Ref will be the same between trace messages - this lets you match up message queue inserts (receive) and removes (gen_server calling your callback)). You can extend this scheme to casts and generic messages but it’s harder to match up the receive and the callback.

If you’re not monitoring too many processes this way, I’d be comfortable with running this long term on a production node. The process receiving trace messages will get one message for every message sent to the traced gen_server, and one for every function call into your callback module’s ‘handle_call’ function. With the right match spec on ‘call’ you can scope this tracing down to a very reasonable volume of trace messages.

I don’t have a neat example of using the trace BIFs to do exactly what you want, but I do have a snippet of erlang shell code that shows a similar trace processor for function call/return latency: https://github.com/archaelus/eshellcode/blob/master/dbg/ets_time_tracer.erl

Cheers,
-Geoff

> On Apr 29, 2015, at 2:12 AM, Roger Lipscombe <roger@REDACTED> wrote:
> 
> For various reasons, I want a metric that measures how long messages
> spend in a process message queue. The process is a gen_server, if that
> has any bearing on the solution. Also, this is for production, so it
> will be always-on and must be low-impact.
> 
> I could do this by timestamping _every_ message sent to the process
> and then reporting the deltas, but I think that's ugly.
> 
> I thought of posting a message to self(), with a timestamp and then
> measuring the delta. I could then post another one as soon as that
> message is processed. Obviously, this leaves the process continually
> looping, which is not good.
> 
> So, instead, I could use erlang:send_after, but that requires two
> messages: a delayed one triggered by erlang:send_after, and an
> immediate one for measuring the delay.
> 
> That's a bit complicated.
> 
> Would it be sensible to send a message, with erlang:send_after, with
> the _expected_ timestamp, and then compute the delta from that?
> 
> Or, alternatively, what other ways could I measure how long a process
> is taking to handle its message queue?
> 
> Thanks,
> Roger.
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions