[erlang-questions] Measuring message queue delay
Fred Hebert
mononcqc@REDACTED
Wed Apr 29 15:19:35 CEST 2015
On 04/29, Roger Lipscombe wrote:
>For various reasons, I want a metric that measures how long messages
>spend in a process message queue. The process is a gen_server, if that
>has any bearing on the solution. Also, this is for production, so it
>will be always-on and must be low-impact.
>
There's a possibility you can get the value in a roundabout way. Say for
example you need an acknowledgement to know when a task it done. You
could, then, set a local timer when you send the task and keep it in
memory.
When the server fetches the message, it takes a stamp, and when it's
done with it, it takes another, tells you how long (in ms or µs) it has
worked on it, and returns it with the final ack. When the caller
receives it, the delta between the reception and sending time is the
time spent in transit.
Sender Server
| |
Ts1 -----------------------> |
| Tr1
| |
| <----------------------- Tr2
Ts2 |
| |
Transit = Ts2 - Ts1 - (Tr2 - Tr2)
Now this does mean you have two queues to worry about, but by adding
this diagnostic message to a bunch of servers (and it can be as simple
as a ping you assume takes 0ms to work on), you're able to report on the
time overhead you're spending shuttling data around. If you maintain the
sender's queue empty, you can likely handwave its overhead away, giving
you a ping message a bit like this:
Sender Server
| |
Ts1 -----------------------> |
| |
| |
| <----------------------- Pong
Ts2 |
| |
Transit = Ts2 - Ts1 - (0)
The caveat is of course that if the Sender is scheduled out because the
system is overloaded, it will appear that the message was in flight
longer than it was truly the case. On the other hand, this reduces the
load on your server process because it has to do a lot less work to
handle the metrics calls.
This does mean that if you do want to enable it permanently for 30k
processes, you have to understand that no matter what you do, this means
that you're gonna need two calls to a clock per message per process. If
you poll every minute, you're gonna end up with 60k calls averaged over
it all, or 1,000 per second. In practice it might be more intense than
that. That might be tolerable, or that might not be. That really depends
on your needs there.
You can try it for a while, but what you may find out will depend on
your load; if most of your messages are requiring a similar workload,
you'll soon find out that the queue length is an adequate proxy for time
spent waiting.
If the message size or some complexity of it is equivalent to blocking
for a long time not servicing the queue, then counting *these* messages
in the system will likely be a lighter proxy that is almost as accurate.
Eventually you may find bottlenecks on which multiple processes wait,
and just looking at these bottleneck processes will work instead of
monitoring everything in your system; better even, you'll be able to
identify where to shed load or apply backpressure to keep the rest of
the system healthy.
Regards,
Fred.
More information about the erlang-questions
mailing list