{"id":1980,"date":"2011-03-25T13:09:02","date_gmt":"2011-03-25T12:09:02","guid":{"rendered":"http:\/\/www.devco.net\/?p=1980"},"modified":"2011-03-25T15:15:46","modified_gmt":"2011-03-25T14:15:46","slug":"monitoring_framework_event_correlation","status":"publish","type":"post","link":"https:\/\/www.devco.net\/archives\/2011\/03\/25\/monitoring_framework_event_correlation.php","title":{"rendered":"Monitoring Framework: Event Correlation"},"content":{"rendered":"
Since my last post I’ve spoken to a lot of people all excited to see something fresh in the monitoring space. I’ve learned a lot – primarily what I learned is that no one tool will please everyone. This is why monitoring systems are so hated – they try to impose their world view, they’re hard to hack on and hard to get data out. This served only to reinforce my believe that rather than build a new monitoring system I should build a framework that can build monitoring systems. <\/p>\n
DevOps shops who can cut code, should be able to build the monitoring they want, not the monitoring their vendor thought they want.<\/em><\/p>\n Thus my focus has not been on how can I declare relationships between services, or how can I declare an escalation matrix. My focus has been on events and how events relate to each other.<\/p>\n Identifying an Event<\/strong> Events need to be identified then so that you can send information related to the same event from many sources. Your trap system might raise a trap about a port on a switch but your stats poller might emit regular packet counts – you need to know these 2 are for the same port. <\/p>\n You can identify events by subject<\/em> and by name<\/em> together they make up the event identity. Subject might be a FQDN of a host and name might be load<\/em> or cpu usage<\/em>. <\/p>\n This way if you have many ways to input information related to some event you just need to identify them correctly.<\/p>\n Finally as each event gets stored they get given a unique ID that you can use to pull out information about just a specific instance of an event.<\/p>\n Types Of Event<\/strong> <\/p>\n The event you see on the right is a metric event – it doesn’t represent one specific status and it’s a time series event which in this case got fed into Graphite.<\/p>\n Status events get tracked automatically – a representation is built for each unique event based on its subject and name. This status representation can progress through states like OK, Warning, Critical etc. Events sent from many different sources gets condensed and summarized into a single status representing how that status looks based on most recent received data – regardless of source of the data. <\/p>\n Each state transition and each non 0 severity event will raise an Alert and get routed to a – pluggable – notification framework or frameworks.<\/p>\n Event Associations and Metadata<\/strong><\/p>\n Events can have a lot of additional data past what the framework needs, this is one of the advantages of NoSQL based storage. A good example of this would be a GitHub commit hook<\/a>. You might want to store this and retain the rich data present in this event.<\/p>\n My framework lets you store all this additional data in the event archive and later on you can pick it up based on event ID and get hold of all this rich data to build reactive alerting or correction based on call backs.<\/p>\n
\nEvents can come from many places, in the recent video demo I did<\/a> you saw events from Nagios and events from MCollective. I also have event bridges for my Apache Blackbox<\/a>, SNMP Traps and it would be trivial to support events from GitHub commit hooks, Amazon SNS<\/a> and really any conceivable source.<\/p>\n
\nI have identified a couple of types of event in the first iteration:<\/p>\n\n