{"id":1532,"date":"2010-07-03T16:12:00","date_gmt":"2010-07-03T15:12:00","guid":{"rendered":"http:\/\/www.devco.net\/?p=1532"},"modified":"2010-08-17T12:03:08","modified_gmt":"2010-08-17T11:03:08","slug":"aggregating_nagios_checks_with_mcollective","status":"publish","type":"post","link":"https:\/\/www.devco.net\/archives\/2010\/07\/03\/aggregating_nagios_checks_with_mcollective.php","title":{"rendered":"Aggregating Nagios Checks With MCollective"},"content":{"rendered":"
A very typical scenario I come across on many sites is the requirement to monitor something like Puppet across 100s or 1000s of machines.<\/p>\n
The typical approaches are to add perhaps a central check on your puppet master or to check using NRPE or NSCA on every node. For this example the option exist to easily check on the master and get one check but that isn’t always easily achievable. <\/p>\n
Think for example about monitoring mail queues on all your machines to make sure things like root mail isn’t getting stuck. In those cases you are forced to do per node checks which inevitably result in huge notification storms in the event that your mail server was down and not receiving the mail from the many nodes.<\/p>\n
MCollective<\/a> has had a plugin that can run NRPE commands<\/a> for a long time, I’ve now added a nagios plugin using this agent to combine results from many hosts.<\/p>\n Sticking with the Puppet example, here are my needs:<\/p>\n This is a pretty painful set of requirements for nagios on its own to achieve. Easy with the help of MCollective.<\/p>\n Ultimately, I just want this:<\/p>\n <\/code><\/p>\n Meaning 42 machines – only ones currently enabled – are all running happily.<\/p>\n We put the NRPE logic on every node. A simple check command in \/etc\/nagios\/nrpe.d\/check_puppet_run.cfg<\/em>:<\/p>\n <\/code><\/p>\n In my case I just want to know there are successful runs happening, if I wanted to know the code is actually compiling correctly I’d monitor the local cache age and size.<\/p>\n Currently this is a bit hacky, I’ve filed tickets with Puppet Labs to improve this. The way to determine if puppet is disabled is to check if the lock file exist and if its 0 bytes. If it’s not zero bytes it means a puppetd<\/em> is currently doing a run – there will be a pid in it. Or the puppetd<\/em> crashed and there’s a stale pid preventing other runs. <\/p>\n To automate this and integrate into MCollective I’ve made a fact puppet_enabled<\/a>. We’ll use this in MCollective discovery to only monitor machines that are enabled. Get this onto all your nodes perhaps using Plugins in Modules<\/a>.<\/p>\n\n
<\/p>\n
\r\nOK: 42 WARNING: 0 CRITICAL: 0 UNKNOWN: 0\r\n<\/pre>\n
The NRPE Check<\/h2>\n
<\/p>\n
\r\ncommand[check_puppet_run]=\/usr\/lib\/nagios\/plugins\/check_file_age -f \/var\/lib\/puppet\/state\/state.yaml -w 5400 -c 7200\r\n<\/pre>\n
Determining if Puppet is enabled or not<\/h2>\n
The MCollective Agent<\/h2>\n