[erlang-questions] System monitoring and logging

Fri Mar 21 16:09:23 CET 2008

I believe that SNMP is not going away anytime soon.  However, using it 
in a "conventional" way in enterprise networks with thousands of 
application instances (including stand-alone C/C++/etc daemons) may lead 
to scalability issues.

However, there are not that many alternatives.  JMS is a promising 
technology, but you are out of luck if you try to use it for monitoring 
anything but Java-based applications.  Its bridging capabilities to SNMP 
are very restrictive as you can't extend the agent as flexibly as the 
Erlang's SNMP agent offers.

The approach that I've used in the past was to use SNMP merely as a 
front-end for accessing monitoring stats stored elsewhere.  That 
elsewhere could be table(s) in a relational database (e.g. mnesia / 
MySql / etc) created automatically from a specification of a MIB file.
The SMIv2 RowStatus field would be mapped to one of the fields in a 
table showing the status of a process responsible for this logical row. 
A proprietary protocol can be used between daemon processes and the 
agent maintaining access to this database to insert/update/delete 
records in the database based on detecting process 
connections/disconnects, and updating data based on process requests 
coming in the form:

	Command = {snmp, new, OsPid, HostName,
	              [{TableName, RowIndex, ColVals}]} |
	          {snmp, set, [{TableName, RowIndex, ColVals}]} |
	          {snmp, get, ...} |
	          {snmp, notify, ...} |
	          ...
	TableName::atom()
	RowIndex = [{Col, Value}]
	ColVals  = [{Col, Value}]
	Col      = integer()
	Value    = integer() | string()

If mnesia is chosen as the storage medium, for scalability of front-end 
access multiple agents could be holding disc_copies replicas of these 
tables.  A monitored process could be responsible for updating stats in 
multiple tables.  It would periodically (every N seconds) dump vital 
stats to these tables though the connection (tcp socket, pipe, unix 
domain socket, etc) to the local agent using a light-weight protocol 
based on ei library.

In this approach separation of a "master agent" and a "subagent" becomes 
meaningless - each agent has access to data in all mnesia tables, so you 
don't need to worry about some agents being responsible for parts of MIB 
trees.

The advantage of this architecture would be that management front-end 
would not have to connect/poll info from individual processes, but 
rather just pull that data from an agent using either SNMP or Web-based 
access.  So, in this regard SNMP becomes just a front-end protocol for 
accessing locally stored data using industry standard tools.  You don't 
need to focus just on SNMP - build nice web-based GUIs using AJAX to 
present data to users.  SNMP would be just a freebie allowing other 
out-of-the-box tools pull data from your monitoring system with no 
additional development.

This approach is actually quite easy to implement, works in 
heterogeneous environments / languages, and worked for us well.

Serge

Hal Snyder wrote:
> About making monitoring built-in instead of an add-on:
> 
> Yaws can be a great monitoring tool - which can then be coupled  
> upstream with a bit of scripting glue to the monitoring front-end  
> preferred at the site in question, SNMP or other.
> 
> An approach I've used is to generate a lot of the app, supervisor,  
> etc. boilerplate for each new service coming up. Included in the  
> boilerplate is a yaws server on each node, with standardized layout  
> containing a link for each custom Erlang application on the node. The  
> supervisor and main application boilerplate include code to  
> initialize an ets table for tracking stats and yaws pages for  
> exposing those stats. The generic yaws page includes node start time,  
> uptime, nodes list, and a view of the ets stats table. These features  
> then become available with no added work when a new application is  
> under development.
> 
> It's easy to add specific parameters to the stats table that track  
> transaction count, error count, etc. These are then automatically  
> visible in the yaws page for the application.
> 
> A large platform consists of several separate Erlang clusters, each  
> of which has at least two replicating mnesia nodes which store  
> configuration data (also viewable and settable via yaws boilerplate).
> 
> A management station can interrogate the core mnesia nodes to see  
> what else is in a cluster, then collect stats from all active nodes  
> and their application stats.
> 
> By the way, it was surprising how useful the template/boilerplate  
> system can be. When a developer comes up with a good idea on how to  
> initialize or manage or update a node, you audit the new approach and  
> modify for portability, then merge it into the code generation  
> system. Your nodes get smarter and smarter over the months as they  
> roll out. Kind of like Erlang programming in general, one of the  
> winning features of the templating approach is the gradual learning  
> curve. Start with something simple that works. It immediately becomes  
> useful and justifies the effort. Extend ad lib.
> 
> On Mar 19, 2008, at 8:54 AM, Peter Mechlenborg wrote:
> 
>> Hi
>>
>> For the last 18 month or so I have been working on an interesting
>> project written in Erlang.  Over the last months it has become clear
>> to me that we need a more structured way of monitoring our systems.
>> Right now we basically just have a log file with lots of different
>> information.  I'm starting to realize that monitoring and visibility
>> are important properties that should be an integrated part of our
>> architecture; not an add-on.  I also think this applies to almost all
>> server systems, especially those with high demands on fault
>> tolerance, so this issue must have been solved many times before in
>> Erlang, or am I wrong here?
>>
>> We have started looking into SNMP, and this seems promising, even
>> though it seem a bit old (I get the impression that SNMP where hot 10
>> years ago, but is kind of phasing out right now.  Is this correct?)
>> and rigid.  I have not been able to find any alternatives to SNMP, do
>> there exist any?  I would really like some feedback on how you guys
>> handle monitoring and logging on the systems you develop and operate,
>> do you use SNMP, some other framework, nothing, or something home  
>> grown.
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>