[erlang-questions] System monitoring and logging
Serge Aleynikov
saleyn@REDACTED
Fri Mar 21 16:09:23 CET 2008
I believe that SNMP is not going away anytime soon. However, using it
in a "conventional" way in enterprise networks with thousands of
application instances (including stand-alone C/C++/etc daemons) may lead
to scalability issues.
However, there are not that many alternatives. JMS is a promising
technology, but you are out of luck if you try to use it for monitoring
anything but Java-based applications. Its bridging capabilities to SNMP
are very restrictive as you can't extend the agent as flexibly as the
Erlang's SNMP agent offers.
The approach that I've used in the past was to use SNMP merely as a
front-end for accessing monitoring stats stored elsewhere. That
elsewhere could be table(s) in a relational database (e.g. mnesia /
MySql / etc) created automatically from a specification of a MIB file.
The SMIv2 RowStatus field would be mapped to one of the fields in a
table showing the status of a process responsible for this logical row.
A proprietary protocol can be used between daemon processes and the
agent maintaining access to this database to insert/update/delete
records in the database based on detecting process
connections/disconnects, and updating data based on process requests
coming in the form:
Command = {snmp, new, OsPid, HostName,
[{TableName, RowIndex, ColVals}]} |
{snmp, set, [{TableName, RowIndex, ColVals}]} |
{snmp, get, ...} |
{snmp, notify, ...} |
...
TableName::atom()
RowIndex = [{Col, Value}]
ColVals = [{Col, Value}]
Col = integer()
Value = integer() | string()
If mnesia is chosen as the storage medium, for scalability of front-end
access multiple agents could be holding disc_copies replicas of these
tables. A monitored process could be responsible for updating stats in
multiple tables. It would periodically (every N seconds) dump vital
stats to these tables though the connection (tcp socket, pipe, unix
domain socket, etc) to the local agent using a light-weight protocol
based on ei library.
In this approach separation of a "master agent" and a "subagent" becomes
meaningless - each agent has access to data in all mnesia tables, so you
don't need to worry about some agents being responsible for parts of MIB
trees.
The advantage of this architecture would be that management front-end
would not have to connect/poll info from individual processes, but
rather just pull that data from an agent using either SNMP or Web-based
access. So, in this regard SNMP becomes just a front-end protocol for
accessing locally stored data using industry standard tools. You don't
need to focus just on SNMP - build nice web-based GUIs using AJAX to
present data to users. SNMP would be just a freebie allowing other
out-of-the-box tools pull data from your monitoring system with no
additional development.
This approach is actually quite easy to implement, works in
heterogeneous environments / languages, and worked for us well.
Serge
Hal Snyder wrote:
> About making monitoring built-in instead of an add-on:
>
> Yaws can be a great monitoring tool - which can then be coupled
> upstream with a bit of scripting glue to the monitoring front-end
> preferred at the site in question, SNMP or other.
>
> An approach I've used is to generate a lot of the app, supervisor,
> etc. boilerplate for each new service coming up. Included in the
> boilerplate is a yaws server on each node, with standardized layout
> containing a link for each custom Erlang application on the node. The
> supervisor and main application boilerplate include code to
> initialize an ets table for tracking stats and yaws pages for
> exposing those stats. The generic yaws page includes node start time,
> uptime, nodes list, and a view of the ets stats table. These features
> then become available with no added work when a new application is
> under development.
>
> It's easy to add specific parameters to the stats table that track
> transaction count, error count, etc. These are then automatically
> visible in the yaws page for the application.
>
> A large platform consists of several separate Erlang clusters, each
> of which has at least two replicating mnesia nodes which store
> configuration data (also viewable and settable via yaws boilerplate).
>
> A management station can interrogate the core mnesia nodes to see
> what else is in a cluster, then collect stats from all active nodes
> and their application stats.
>
> By the way, it was surprising how useful the template/boilerplate
> system can be. When a developer comes up with a good idea on how to
> initialize or manage or update a node, you audit the new approach and
> modify for portability, then merge it into the code generation
> system. Your nodes get smarter and smarter over the months as they
> roll out. Kind of like Erlang programming in general, one of the
> winning features of the templating approach is the gradual learning
> curve. Start with something simple that works. It immediately becomes
> useful and justifies the effort. Extend ad lib.
>
> On Mar 19, 2008, at 8:54 AM, Peter Mechlenborg wrote:
>
>> Hi
>>
>> For the last 18 month or so I have been working on an interesting
>> project written in Erlang. Over the last months it has become clear
>> to me that we need a more structured way of monitoring our systems.
>> Right now we basically just have a log file with lots of different
>> information. I'm starting to realize that monitoring and visibility
>> are important properties that should be an integrated part of our
>> architecture; not an add-on. I also think this applies to almost all
>> server systems, especially those with high demands on fault
>> tolerance, so this issue must have been solved many times before in
>> Erlang, or am I wrong here?
>>
>> We have started looking into SNMP, and this seems promising, even
>> though it seem a bit old (I get the impression that SNMP where hot 10
>> years ago, but is kind of phasing out right now. Is this correct?)
>> and rigid. I have not been able to find any alternatives to SNMP, do
>> there exist any? I would really like some feedback on how you guys
>> handle monitoring and logging on the systems you develop and operate,
>> do you use SNMP, some other framework, nothing, or something home
>> grown.
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
More information about the erlang-questions
mailing list