[erlang-questions] Exometer, Recon and prim_inet:send/2?

Mon Oct 12 11:08:03 CEST 2015

All,
This is an operational question, i.e. how to pin point alerts and co.

I monitor process queues and get the longest queue with Recon, using:
recon:proc_count(message_queue_len, 1).

I'm using exometer_core to send data to HostedGraphite.

Tonight I received errors stating that the queue exceeded this number, and
this is the queue info that Recon gives me:
[exometer_report_graphite,{current_function,{prim_inet,send,3}},{initial_call,{proc_lib,init_p,3}}]

I got reports every 5 seconds (the interval of time used by exometer to
report) from around 00:30 until 08:20, with a message queue growing from a
few hundreds to 6k or so. After 08:20, the queue went back to 0.

I can see that my server was up (no other errors, all functionalities and
pings were correct during the same period). Therefore, it looks to me that
exometer was unable to send the data to the HostedGraphite servers during
the whole period 00:30 - 08:20, and when it was able to it emptied the
queue that went then back to normal.

However: HostedGraphite only has a "hole" of data from around 07:40 until
08:20, i.e. not the whole period where the queue kept growing.

Therefore, my questions is: Is exometer_core performing some local cache,
i.e. if it's unable to send data it will retry? And if so, is this cache
limited? Otherwise I don't see what could be causing the queue lenght of
exometer_report_graphite to grow. But if this is so, then why am I seeing a
hole in data?

Than kyou,
r.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20151012/58c1e1f2/attachment.htm>