[erlang-questions] poll: monitoring performance of production erlang nodes

Sun Nov 2 22:18:56 CET 2008

We have a statistics server that runs on all nodes and collects statistics
associated with a key; the key space is automatically partitioned between
the servers so that when we add more nodes the load is split.  Associated
with each key is an rrd file and we have a erlang port for rrdtool to
write the data; the rrds are updated peridocially (e.g. every minute).
Finally we use yaws cgi to access drraw which is a web graphical interface
for rrdtool; we use a distributed filesystem so that any node can read any
of the rrd files (basically, the access pattern is many readers one
writer, with the servers negotiating who owns a particular key and thus
rrd file; we wrote our own simple distributed filesystem in erlang because
we're on EC2 and wanted some operational simplicity in the face of dynamic
node membership; but in a normal fixed-set-of-servers setup there are lots
of choices for the DFS bit).

When we want to know about something, we sprinkle calls in the code that
emit a statistic with a key; when new keys are encountered, new rrd files
are created.

It mostly works great.  The one problem I have with the setup is that we
end up having to litter the code with statistics collection calls; this
can clutter the exposition.  With language support, one could envision an
aspect-oriented approach to statistics collection which would not clutter
the code, similar to how traces work.  (We could use traces now to collect
timing statistics if we were willing to make intermediate functions for
code regions of interest).  Another possibility would be to define a
parse_transform which inserts the statistics collection calls, keeping the
code clean.  However code maintenance is potentially complexified.  In the
end we are lazy so we live with it.

I would love to see more language support for monitoring.  Erlang is
already very operationally oriented and this seems like an area where
everybody is rolling their own and we could all benefit from a common
approach.

-- p

On Sun, 2 Nov 2008, Joel Reymont wrote:

> Forgive me for polling the community one last time...
>
> How do you folks with production Erlang systems monitor performance?
>
> I want to measure packet roundtrip to clients, memory use throughout
> the day, message queue lengths, etc.
>
> I have timings inserted in key places and a "statistics" server that
> collects the measurements in ETS. I can then ask for a data dump as
> CSV and analyze the data in a spreadsheet app or otherwise.
>
> How do YOU do it?
>
> 	Thanks, Joel
>
> --
> wagerlabs.com
>
>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>

In an artificial world, only extremists live naturally.

        -- Paul Graham