[erlang-questions] Measuring message queue delay

Thu Apr 30 23:27:17 CEST 2015

Although I agree that the "normal message triggers a marker" algorithm
solves the problem perfectly, I would also like to add a method to the
table that could give a reasonably good approximation of the message queue
delay without sending any marker messages. Just for the fun of it.

Once again, our process has two states: doing a measurement and not doing a
measurement. When it receives a message while not doing a measurement it
shall:

   - Save the current length of its message queue (N)
   - Save the current time (T1)
   - Switch to doing a measurement state

While doing a measurement simply count the number of messages received
until we reach N messages. At this point save the current time (T2), and
return to not doing a measurement mode.

T2 - T1 is a *lower limit* on the message queue delay (for this N-th
message), because the message we are processing at T2 was already in the
queue at T1. The actual delay may be somewhat larger, since the message
entered the queue at some T0 < T1 time.

We can also get an *upper limit* on the message queue delay if we'd check
the time (T3) at the processing of the N+1:th message: message N+1 was not
yet in the message queue at T1 and got out of the queue at T3, so spent at
most T3 - T1 time waiting.

One more thing to note: T3 = T2 + the time spent processing the N-th
message + the time spent between handling two messages (which is:
gen_server overhead, waiting for the next message to arrive and waiting for
the CPU to run). Since gen_server overhead is low and we can assume there
was no time spent waiting for the next message (that would mean the queue
is empty, and the delay is 0), we can approximate T3 as T2 + time spent
processing the N-th message (the error is the time spent waiting for the
CPU).

So when we finish processing the N-th message we will have a definite lower
limit and a quite accurate upper limit for the message queue delay. They
are not measured for the same message though, but as long as there are no
sudden jumps or drops in the message queue delay it doesn't really matter.

It's up to you whether you want to report both the upper and lower limits
as metrics or calculate a single value from them (e.g. take either of them,
calculate the average or whatever).

In case of gen_servers one possible caveat is that there may be system
messages in the queue as well which are not delivered to your callback
module. This means your message counting will be off by one. But system
messages are extremely rare and unless the process is otherwise idle they
won't cause a significant measurement error.

Cheers,
Daniel

2015-04-29 17:50 GMT+02:00 Jesper Louis Andersen <
jesper.louis.andersen@REDACTED>:

>
> On Wed, Apr 29, 2015 at 11:12 AM, Roger Lipscombe <roger@REDACTED>
> wrote:
>
>> I thought of posting a message to self(), with a timestamp and then
>> measuring the delta. I could then post another one as soon as that
>> message is processed. Obviously, this leaves the process continually
>> looping, which is not good.
>>
>
> I don't think this should be thrown out. It is salvagable, I think, and
> here is how:
>
> We update our gen_server with a new element in the state record:
>
> -record(state, { ..., marker_active = false }).
>
> Whenever we get a normal message, we check this state:
>
> handle_call(Msg, From, #state { marker_active = false } = State) ->
>     self() ! {marker, erlang:monotonic_time()},
>     handle_call(Msg, From, State#state { marker_active = true });
> ....
>
> Of course handle_cast/2 and handle_info/2 needs the same treatment. This
> means whenever there is flow in the mailbox, we enable a marker with a
> timestamp.
>
> Now, we handle the marker:
>
> handle_info({marker, Old}, State) ->
>     Diff = erlang:monotonic_time() - Old,
>     report_to_hdrhistogram(erlang:convert_time_unit(Diff, native,
> milli_seconds)),
>     {noreply, State#state { marker_active = false }}
> ...
>
> This means once we get a marker, we process its queue sojourn latency, but
> we avoid reinjecting a new marker. The next processed message will do that.
>
> In turn, if you have 1 message going in, there will be exactly one marker.
> But if other processes manages to enter 100 messages between a marker and
> the next one, then we have one marker per 100 messages. The process will
> never spin if there is nothing to process. An alternative is to add some
> sampling interval delay:
>
> handle_info({marker, Old}, State) ->
>     Diff = erlang:monotonic_time() - Old,
>     report_to_hdrhistogram(erlang:convert_time_unit(Diff, native,
> milli_seconds)),
>     timer:send_after(50, self(), rearm_marker),
>     {noreply, State};
> handle_info(rearm_marker, State) ->
>     {noreply, State#state { marker_active = false }};
> ...
>
> which closes off the problem where you have markers all over the place all
> the time.
>
> The essential idea is to use the arrival of a "normal" message as a
> trigger for invoking a time sojourn marker. And never have the marker
> itself reinject a marker so it can't cycle. In turn, there has to be a flux
> through your mailbox queue for the markers to get added, and if there is no
> flux, no marker will ever get added. The really astute reader will prove my
> ramblings :)
>
> --
> J.
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150430/0643464e/attachment.htm>