[erlang-questions] Measuring message queue delay

Wed Apr 29 15:19:35 CEST 2015

On 04/29, Roger Lipscombe wrote:
>For various reasons, I want a metric that measures how long messages
>spend in a process message queue. The process is a gen_server, if that
>has any bearing on the solution. Also, this is for production, so it
>will be always-on and must be low-impact.
>

There's a possibility you can get the value in a roundabout way. Say for 
example you need an acknowledgement to know when a task it done. You 
could, then, set a local timer when you send the task and keep it in 
memory.

When the server fetches the message, it takes a stamp, and when it's 
done with it, it takes another, tells you how long (in ms or µs) it has 
worked on it, and returns it with the final ack. When the caller 
receives it, the delta between the reception and sending time is the 
time spent in transit.

    Sender                      Server
      |                            |
     Ts1 ----------------------->  |
      |                           Tr1
      |                            |
      |  <----------------------- Tr2
     Ts2                           |
      |                            |

    Transit = Ts2 - Ts1 - (Tr2 - Tr2)

Now this does mean you have two queues to worry about, but by adding 
this diagnostic message to a bunch of servers (and it can be as simple 
as a ping you assume takes 0ms to work on), you're able to report on the 
time overhead you're spending shuttling data around. If you maintain the 
sender's queue empty, you can likely handwave its overhead away, giving 
you a ping message a bit like this:

    Sender                      Server
      |                            |
     Ts1 ----------------------->  |
      |                            |
      |                            |
      |  <----------------------- Pong
     Ts2                           |
      |                            |

    Transit = Ts2 - Ts1 - (0)

The caveat is of course that if the Sender is scheduled out because the 
system is overloaded, it will appear that the message was in flight 
longer than it was truly the case. On the other hand, this reduces the 
load on your server process because it has to do a lot less work to 
handle the metrics calls.

This does mean that if you do want to enable it permanently for 30k 
processes, you have to understand that no matter what you do, this means 
that you're gonna need two calls to a clock per message per process. If 
you poll every minute, you're gonna end up with 60k calls averaged over 
it all, or 1,000 per second. In practice it might be more intense than 
that. That might be tolerable, or that might not be. That really depends 
on your needs there.

You can try it for a while, but what you may find out will depend on 
your load; if most of your messages are requiring a similar workload, 
you'll soon find out that the queue length is an adequate proxy for time 
spent waiting.

If the message size or some complexity of it is equivalent to blocking 
for a long time not servicing the queue, then counting *these* messages 
in the system will likely be a lighter proxy that is almost as accurate.
Eventually you may find bottlenecks on which multiple processes wait, 
and just looking at these bottleneck processes will work instead of 
monitoring everything in your system; better even, you'll be able to 
identify where to shed load or apply backpressure to keep the rest of 
the system healthy.

Regards,
Fred.