{"id":1532,"date":"2010-07-03T16:12:00","date_gmt":"2010-07-03T15:12:00","guid":{"rendered":"http:\/\/www.devco.net\/?p=1532"},"modified":"2010-08-17T12:03:08","modified_gmt":"2010-08-17T11:03:08","slug":"aggregating_nagios_checks_with_mcollective","status":"publish","type":"post","link":"https:\/\/www.devco.net\/archives\/2010\/07\/03\/aggregating_nagios_checks_with_mcollective.php","title":{"rendered":"Aggregating Nagios Checks With MCollective"},"content":{"rendered":"<p>A very typical scenario I come across on many sites is the requirement to monitor something like Puppet across 100s or 1000s of machines.<\/p>\n<p>The typical approaches are to add perhaps a central check on your puppet master or to check using NRPE or NSCA on every node.  For this example the option exist to easily check on the master and get one check but that isn&#8217;t always easily achievable.  <\/p>\n<p>Think for example about monitoring mail queues on all your machines to make sure things like root mail isn&#8217;t getting stuck.   In those cases you are forced to do per node checks which inevitably result in huge notification storms in the event that your mail server was down and not receiving the mail from the many nodes.<\/p>\n<p><a href=\"http:\/\/marionette-collective.org\/\">MCollective<\/a> has had <a href=\"http:\/\/code.google.com\/p\/mcollective-plugins\/wiki\/AgentNRPE\">a plugin that can run NRPE commands<\/a> for a long time, I&#8217;ve now added a nagios plugin using this agent to combine results from many hosts.<\/p>\n<p>Sticking with the Puppet example, here are my needs:<\/p>\n<ul>\n<li>I want to know if anywhere some puppet machine isn&#8217;t successfully doing runs.<\/li>\n<li>I want to be able to do <em>puppetd &#8211;disable<\/em> and not get alerts for those machines.<\/li>\n<li>I do not want to change any configs when I am adding new machines, it should just work.<\/li>\n<li>I want the ability to do monitoring on subsets of machines on different probes<\/li>\n<\/ul>\n<p>This is a pretty painful set of requirements for nagios on its own to achieve.  Easy with the help of MCollective.<\/p>\n<p>Ultimately, I just want this:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\nOK: 42 WARNING: 0 CRITICAL: 0 UNKNOWN: 0\r\n<\/pre>\n<p><\/code><\/p>\n<p>Meaning 42 machines &#8211; only ones currently enabled &#8211; are all running happily.<\/p>\n<h2>The NRPE Check<\/h2>\n<p>We put the NRPE logic on every node.  A simple check command in <em>\/etc\/nagios\/nrpe.d\/check_puppet_run.cfg<\/em>:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\ncommand[check_puppet_run]=\/usr\/lib\/nagios\/plugins\/check_file_age -f \/var\/lib\/puppet\/state\/state.yaml -w 5400 -c 7200\r\n<\/pre>\n<p><\/code><\/p>\n<p>In my case I just want to know there are successful runs happening, if I wanted to know the code is actually compiling correctly I&#8217;d monitor the local cache age and size.<\/p>\n<h2>Determining if Puppet is enabled or not<\/h2>\n<p>Currently this is a bit hacky, I&#8217;ve filed tickets with Puppet Labs to improve this.  The way to determine if puppet is disabled is to check if the lock file exist and if its 0 bytes.  If it&#8217;s not zero bytes it means a <em>puppetd<\/em> is currently doing a run &#8211; there will be a pid in it.  Or the <em>puppetd<\/em> crashed and there&#8217;s a stale pid preventing other runs.  <\/p>\n<p>To automate this and integrate into MCollective I&#8217;ve made a fact <a href=\"http:\/\/github.com\/ripienaar\/facter-facts\/tree\/master\/puppet-enabled\/\">puppet_enabled<\/a>.  We&#8217;ll use this in MCollective discovery to only monitor machines that are enabled.  Get this onto all your nodes perhaps using <a href=\"http:\/\/docs.reductivelabs.com\/guides\/plugins_in_modules.html\">Plugins in Modules<\/a>.<\/p>\n<h2>The MCollective Agent<\/h2>\n<p>You want to deploy the <a href=\"http:\/\/code.google.com\/p\/mcollective-plugins\/wiki\/AgentNRPE\">MCollective NRPE Agent<\/a> to all your nodes, once you&#8217;ve got it right you can test it easily using something like this:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\n% mc-nrpe -W puppet_enabled=1 check_puppet_run\r\n\r\n * [ ============================================================> ] 47 \/ 47\r\n\r\nFinished processing 47 \/ 47 hosts in 395.51 ms\r\n              OK: 47\r\n         WARNING: 0\r\n        CRITICAL: 0\r\n         UNKNOWN: 0\r\n<\/pre>\n<p><\/code><\/p>\n<p>Note we&#8217;re restricting the run to only enabled hosts.<\/p>\n<h2>Integrating into Nagios<\/h2>\n<p>The last step is to add this to nagios.  I create SSL certs and a specific client configuration for Nagios and put these in it&#8217;s home directory.<\/p>\n<p>The <em>check-mc-nrpe<\/em> plugin works best with Nagios 3 as it will return subsequent lines of output indicating which machines are in what state so you get the details hidden behind the aggregation in alerts.  It also outputs performance data for total node, each status and also how long it took to do the check.<\/p>\n<p>The nagios command would be something like this:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\ndefine command{\r\n        command_name                    check_mc_nrpe\r\n        command_line                    \/usr\/sbin\/check-mc-nrpe  --config \/var\/log\/nagios\/.mcollective\/client.cfg  -W $ARG1$ $ARG2$\r\n}\r\n<\/pre>\n<p><\/code><\/p>\n<p>And finally we need to make a service:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\ndefine service{\r\n        host_name                       monitor1\r\n        service_description             mc_puppet-run\r\n        use                             generic-service\r\n        check_command                   check_mc_nrpe!puppet_enabled=1!check_puppet_run\r\n        notification_period             awakehours\r\n        contact_groups                  sysadmin\r\n}\r\n<\/pre>\n<p><\/code><\/p>\n<p>Here are a few other command examples I use:<\/p>\n<p>All machines with my Puppet class &#8220;pki&#8221;, check the age of certs:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\ncheck_command   check_mc_nrpe!pki!check_pki\r\n<\/pre>\n<p><\/code><\/p>\n<p>All machines with my Puppet class &#8220;bacula::node&#8221;, make sure the FD is running:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\ncheck_command   check_mc_nrpe!bacula::node!check_fd\r\n<\/pre>\n<p><\/code><\/p>\n<p>&#8230;and that they were backed up:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\ncheck_command   check_mc_nrpe!bacula::node!check_bacula_main\r\n<\/pre>\n<p><\/code><\/p>\n<p>Using this I removed 100s of checks from my monitoring platform, saving on resources and making sure I can do my critical monitor tasks better.<\/p>\n<p>Depending on the quality of your monitoring system you might even get a graph showing the details hidden behind the aggregation:<\/p>\n<p><center><img decoding=\"async\" src=\"http:\/\/www.devco.net\/images\/mcbacula.png\"><\/center><\/p>\n<p>The above is a graph showing a series of servers where the backup ran later than usual, I had 2 alerts only, would have had more than 30 before aggregation.<\/p>\n<p><\/p>\n<h2>Restrictions for Probes<\/h2>\n<p>The last remaining requirement I had was to be able to do checks on different probes and restrict them.  My Collective is one big one spread all over the world which means sometimes things are a bit slow discovery wise.  <\/p>\n<p>So I have many nagios servers doing local checks.  Using MCollective discovery I can now easily restrict checks, for example If I only wanted to check machines in the USA and I had a fact <em>country<\/em> I only have to change my command line in the service declaration:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\ncheck_command   check_mc_nrpe!puppet_enabled=1 country=us!check_puppet_run\r\n<\/pre>\n<p><\/code><\/p>\n<p>This will then via MCollective discovery just monitor machines in the US.<\/p>\n<h2>What to monitor this way<\/h2>\n<p>As this style of monitoring is done using Discovery you would need to think carefully about what you monitor this way.  It&#8217;s totally conceivable that if a node is under high CPU load that it wont respond to discovery commands in time, and so wont get monitored!<\/p>\n<p>You would then for example not want to monitor things like load averages or really critical services this way, but we all have a lot of peripheral things like zombie process counts and a lot of other places where aggregation makes a lot of sense, in those cases by all means consider this approach.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A very typical scenario I come across on many sites is the requirement to monitor something like Puppet across 100s or 1000s of machines. The typical approaches are to add perhaps a central check on your puppet master or to check using NRPE or NSCA on every node. For this example the option exist to [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","footnotes":""},"categories":[7],"tags":[121,85,78,64],"_links":{"self":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/1532"}],"collection":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/comments?post=1532"}],"version-history":[{"count":28,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/1532\/revisions"}],"predecessor-version":[{"id":1694,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/1532\/revisions\/1694"}],"wp:attachment":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/media?parent=1532"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/categories?post=1532"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/tags?post=1532"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}