R.I. Pienaar | R.I.Pienaar

Managing Puppet Using MCollective

by R.I. Pienaar | Feb 3, 2013 | Uncategorized

I recently gave a talk titled “Managing Puppet Using MCollective” at the Puppet Camp in Ghent.

The talk introduces a complete rewrite of the MCollective plugin used to manage Puppet. The plugin can be found on our Github repo as usual. Significantly this is one of a new breed of plugin that we ship as native OS packages and practice continuous delivery on.

The packages can be found on apt.puppetlabs.com and yum.puppetlabs.com and are simply called mcollective-puppet-agent and mcollective-puppet-client.

This set of plugins show case a bunch of recent MCollective features including:

Data Plugins
Aggregation Functions
Custom Validators
Configurable enabling and disabling of the Agent
Direct Addressing and pluggable discovery to significantly improve the efficiency of the runall method
Utility classes shared amongst different types of plugin
Extensive testing using rspec and our mcollective specific rspec plugins

It’s a bit of a beast coming at a couple thousand lines but this was mostly because we had to invent a rather sizeable wrapper for Puppet to expose a nice API around Puppet 2.7 and 3.x for things like running them and obtaining their status.

The slides from the talk can be seen below, hopefully a video will be up soon else I’ll turn it into a screencast.

Managing Puppet using MCollective from Puppet Labs

Graphing on the CLI

by R.I. Pienaar | Jan 20, 2013 | Uncategorized

I’ve recently been thinking about ways to do graphs on the CLI. We’ve written a new Puppet Agent for MCollective that can gather all sorts of interesting data from your server estate and I’d really like to be able to show this data on the CLI. This post isn’t really about MCollective though the ideas applies to any data.

I already have sparklines in MCollective, here’s the distribution of ping times:

This shows you that most of the nodes responded quickly with a bit of a tail at the end being my machines in the US.

Sparklines are quite nice for a quick overview so I looked at adding some more of this to the UI and came up with this:

Which is quite nice – these are the nodes in my infrastructure stuck into buckets and the node counts for each bucket is shown. We can immediately tell something is not quite right – the config retrieval time shows a bunch of slow machines and the slowness does not correspond to resource counts etc. On investigation I found these are my dev machines – KVM nodes hosted on HP Micro Servers so that’s to be expected.

I am not particularly happy with these graphs though so am still exploring other options, one other option is GNU Plot.

GNU Plot can target its graphs for different terminals like PNG and also line printers – since the Unix terminal is essentially a line printer we can use this.

Here are 2 graphs of config retrieval time produced by MCollective using the same data source that produced the spark line above – though obviously from a different time period. Note that the axis titles and graph title is supplied automatically using the MCollective DDL:

$ mco plot resource config_retrieval_time
 
                   Information about Puppet managed resources
  Nodes
    6 ++-*****----+----------+-----------+----------+----------+----------++
      +      *    +          +           +          +          +           +
      |       *                                                            |
    5 ++      *                                                           ++
      |       *                                                            |
      |        *                                                           |
    4 ++       *      *                                                   ++
      |        *      *                                                    |
      |         *    * *                                                   |
    3 ++        *    * *                                                  ++
      |          *  *  *                                                   |
      |           * *   *                                                  |
    2 ++           *    *                         *        *              ++
      |                 *                         **       **              |
      |                  *                       * *      *  *             |
    1 ++                 *               *       *  *     *   **        * ++
      |                  *              * *     *   *     *     **    **   |
      +           +       *  +         * + *    *   +*   *     +     *     +
    0 ++----------+-------*************--+--****----+*****-----+--***-----++
      0           10         20          30         40         50          60
                              Config Retrieval Time

So this is pretty serviceable for showing this data on the console! It wouldn’t scale to many lines but for just visualizing some arbitrary series of numbers it’s quite nice. Here’s the GNU Plot script that made the text graph:

set title "Information about Puppet managed resources"
set terminal dumb 78 24
set key off
set ylabel "Nodes"
set xlabel "Config Retrieval Time"
plot '-' with lines
3 6
6 6
9 3
11 2
14 4
17 0
20 0
22 0
25 0
28 0
30 1
33 0
36 038 2
41 0
44 0
46 2
49 1
52 0
54 0
57 1

The magic here comes from the second line that sets the output terminal to dump and supplies some dimensions. Very handy, worth exploring some more and adding to your toolset for the CLI. I’ll look at writing a gem or something that supports both these modes.

There are a few other players in this space, I definitely recall coming across a Python tool to do graphs but cannot find it now, shout out in the comments if you know other approaches and I’ll add them to the post!

Updated: some links to related projects: sparkler, Graphite Spark

Solving monitoring state storage problems using Redis

by R.I. Pienaar | Jan 6, 2013 | Code

Redis is an in-memory key-value data store that provides a small number of primitives suitable to the task of building monitoring systems. As a lot of us are hacking in this space I thought I’d write a blog post summarizing where I’ve been using it in a little Sensu like monitoring system I have been working on on and off.

There’s some monitoring related events coming up like MonitoringLove in Antwerp and Monitorama in Boston – I will be attending both and I hope a few members in the community will create similar background posts on various interesting areas before these events.

I’ve only recently started looking at Redis but really like it. It’s a very light weight daemon written in C with fantastic documentation detailing things like each commands performance characteristics and most documantation pages are live in that they have a REPL right on the page like the SET page – note you can type into the code sample and see your changes in real time. It is sponsored by VMWare and released under the 3 clause BSD license.

Redis Data Types

Redis provides a few common data structures:

Normal key-value storage where every key has just one string value
Hashes where every key contains a hash of key-values strings
Lists of strings – basically just plain old Arrays sorted in insertion order that allows duplicate values
Sets are a bit like Lists but with the addition that a given value can only appear in a list once
Sorted Sets are sets that in addition to the value also have a weight associated with it, the set is indexed by weight

All the keys support things like expiry based on time and TTL calculation. Additionally it also supports PubSub.

At first it can be hard to imagine how you’d use a data store with only these few data types and capable of only storing strings for monitoring but with a bit of creativity it can be really very useful.

The full reference about all the types can be found in the Redis Docs: Data Types

Monitoring Needs

Monitoring systems generally need a number of different types of storage. These are configuration, event archiving and status and alert tracking. There are more but these are the big ticket items, of the 3 I am only going to focus on the last one – Status and Alert Tracking here.

Status tracking is essentially transient data. If you loose your status view it’s not really a big deal it will be recreated quite quickly as new check results come in. Worst case you’ll get some alerts again that you recently got. This fits well with Redis that doesn’t always commit data soon as it receives it – it flushes roughly every second from memory to disk.

Redis does not provide much by way of SSL or strong authentication so I tend to consider it a single node IPC system rather than say a generic PubSub system. I feed data into a node using system like ActiveMQ and then for comms and state tracking on a single node I’ll use Redis.

I’ll show how it can be used to solve the following monitoring related storage/messaging problems:

Check Status – a check like load on every node
Staleness Tracking – you need to know when a node is not receiving check results so you can do alive checks
Event Notification – your core monitoring system will likely feed into alerters like Opsgenie and metric storage like Graphite
Alert Tracking – you need to know when you last sent an alert and when you can alert again based on an interval like every 2 hours

Check Status

The check is generally the main item of monitoring systems. Something configures a check like load and then every node gets check results for this item, the monitoring system has to track the status of the checks on a per node basis.

In my example a check result looks more or less like this:

{"lastcheck"        => "1357490521", 
 "count"            => "1143", 
 "exitcode"         => "0", 
 "output"           => "OK - load average: 0.23, 0.10, 0.02", 
 "last_state_change"=> "1357412507",
 "perfdata"         => '{"load15":0.02,"load5":0.1,"load1":0.23}',
 "check"            => "load",
 "host"             => "dev2.devco.net"}

This is standard stuff and the most boring part – you might guess this goes into a Hash and you’ll be right. Note the count item there Redis has special handling for counters and I’ll show that in a minute.

By convention Redis keys are name spaced by a : so I’d store the check status for a specific node + check combination in a key like status:example.net:load

Updating or creating a new hash is real easy – just write to it:

def save_check(check)
  key = "status:%s:%s" % [check.host, check.check]
 
  check.last_state_change = @redis.hget(key, "last_state_change")
  check.previous_exitcode = @redis.hget(key, "exitcode")
 
  @redis.multi do
    @redis.hset(key, "host", check.host)
    @redis.hset(key, "check", check.check)
    @redis.hset(key, "exitcode", check.exitcode)
    @redis.hset(key, "lastcheck", check.last_check)
    @redis.hset(key, "last_state_change", check.last_state_change)
    @redis.hset(key, "output", check.output)
    @redis.hset(key, "perfdata", check.perfdata)
 
    unless check.changed_state?
      @redis.hincrby(key, "count", 1)
    else
      @redis.hset(key, "count", 1)
    end
  end
 
  check.count = @redis.hget(key, "count")
end

Here I assume we have a object that represents a check result called check and we’re more or less just fetching/updating data in it. I first retrieve the previously saved state of exitcode and last state change time and save those into the object. The object will do some internal state management to determine if the current check result represents a changed state – OK to WARNING etc – based on this information.

The @redis.multi starts a transaction, everything inside the block will be written in an atomic way by the Redis server thus ensuring we do not have any half-baked state while other parts of the system might be reading the status of this check.

As I said the check determines if the current result is a state change when I set the previous exitcode on line 5 this means lines 16-20 will either set the count to 1 if it’s a change or just increment the count if not. We use the internal Redis counter handling on line 17 to avoid having to first fetch the count and then update it and saving it, this saves a round trip to the database.

You can now just retrieve the whole hash with the HGETALL command, even on the command line:

% redis-cli hgetall status:dev2.devco.net:load
 1) "check"
 2) "load"
 3) "host"
 4) "dev2.devco.net"
 5) "output"
 6) "OK - load average: 0.00, 0.00, 0.00"
 7) "lastcheck"
 8) "1357494721"
 9) "exitcode"
10) "0"
11) "perfdata"
12) "{\"load15\":0.0,\"load5\":0.0,\"load1\":0.0}"
13) "last_state_change"
14) "1357412507"
15) "count"
16) "1178"

References: Redis Hashes, MULTI, HSET, HINCRBY, HGET, HGETALL

Staleness Tracking

Staleness Tracking here means we want to know when last we saw any data about a node, if the node is not providing information we need to go and see what happened to it. Maybe it’s up but the data sender died or maybe it’s crashed.

This is where we really start using some of the Redis features to save us time. We need to track when last we saw a specific node and then we have to be able to quickly find all nodes not seen within certain amount of time like 120 seconds.

We could retrieve all the check results and check their last updated time and so figure it out but that’s not optimal.

This is what Sorted Lists are for. Remember Sorted Lists have a weight and orders the list by the weight, if we use the timestamp that we last received data at for a host as the weight it means we can very quickly fetch a list of stale hosts.

1
2
3

def update_host_last_seen(host, time)
  @redis.zadd("host:last_seen", time, host)
end

When we call this code like update_host_last_seen(“dev2.devco.net”, Time.now.utc.to_i) the host will either be added to or updated in the Sorted List based on the current UTC time. We do this every time we save a new result set with the code in the previous section.

To get a list of hosts that we have not seen in the last 120 seconds is really easy now:

1
2
3

def get_stale_hosts(age)
  @redis.zrangebyscore("host:last_seen", 0, (Time.now.utc.to_i - age))
end

If we call this with an age like 120 we’ll get an array of nodes that have not had any data within the last 120 seconds.

You can do the same check on the CLI, this shows all the machines not seen in the last 60 seconds:

% redis-cli zrangebyscore host:last_seen 0 $(expr $(date +%s) - 60)
 1) "dev1.devco.net"

Reference: Sorted Sets, ZADD, ZRANGEBYSCORE

Event Notification

When a check result enters the system thats either a state change, a problem or have metrics associated it we’d want to send those on to other pieces of code.

We don’t know or care who those interested parties are we only care that there might be some interested parties – it might be something writing to Graphite or OpenTSDB or both at the same time or something alerting to Opsgenie or Pager Duty. This is a classic use case for PubSub and Redis has a good PubSub subsystem that we’ll use for this.

I am only going to show the metrics publishing – problem and state changes are very similar:

def publish_metrics(check)
  if check.has_perfdata?
    msg = {"metrics" => check.perfdata, "type" => "metrics", "time" => check.last_check, "host" => check.host, "check" => check.check}.to_json
    publish(["metrics", check.host, check.check], msg)
  end
end
 
def publish(type, message)
  target = ["overwatch", Array(type).join(":")].join(":")
  @redis.publish(target, message)
end

This is pretty simple stuff, we’re just publishing some JSON to a named destination like overwatch:metrics:dev1.devco.net:load. We can now write small standalone single function tools that consume this stream of metrics and send it wherever we like – like Graphite or OpenTSDB.

We publish similar events for any incoming check result that is not OK and also for any state transition like CRITICAL to OK, these would be consumed by alerter handlers that might feed pagers or SMS.

We’re publishing these alerts to to destinations that include the host and specific check – this way we can very easily create individual host views of activity by doing pattern based subscribes.

Reference: PubSub, PUBLISH

Alert Tracking

Alert Tracking means keeping track of which alerts we’ve already sent and when we’ll need to send them again like only after 2 hours of the same problem and not on every check result which might come in every minute.

Leading on from the previous section we’d just consume the problem and state change PubSub channels and react on messages from those:

A possible consumer of this might look like this:

@redis.psubscribe("overwatch:state_change:*", "overwatch:issues:*") do |on|
  on.pmessage do |channel, message|
    event = JSON.parse(message)
 
    case event["type"]
      when "issue"
        sender.notify_issue(event["issue"]["exitcode"], event["host"], event["check"], event["issue"]["output"])
      when "state_change"
        if event["state_change"]["exitcode"] == 0
          sender.notify_recovery(event["host"], event["check"], event["state_change"]["output"])
        end
    end
  end
end

This subscribes to the 2 channels and pass the incoming events to a notifier. Note we’re using the patterns here to catch all alerts and changes for all hosts.

The problem here is that without any special handling this is going to fire off alerts every minute assuming we check the load every minute. This is where Redis expiry of keys come in.

We’ll need to track which messages we have sent when and on any state change clear the tracking thus restarting the counters.

So we’ll just add keys called “alert:dev2.devco.net:load:3” to indicate an UNKNOWN state alert for load on dev2.devco.net:

def record_alert(host, check, status, expire=7200)
  key = "alert:%s:%s:%d" % [host, check, status]
  @redis.set(key, 1)
  @redis.expire(key, expire)
end

This takes an expire time which defaults to 2 hours and tells redis to just remove the key when its time is up.

With this we need a way to figure out if we can send again:

def alert_ttl(host, check, status)
  key = "alert:%s:%s:%d" % [host, check, status]
  @redis.ttl(key)
end

This will return the amount of seconds till next alert and -1 if we are ready to send again

And finally on every state change we need to just purge all the tracking for a given node + check combo. The reason for this is if we notified on CRITICAL a minute ago then the service recovers to OK but soon goes to CRITICAL again this most recent CRITICAL alert will be suppressed as part of the previous cycle of alerts.

1
2
3

def clear_alert_ttls(host, check)
  @redis.del(@redis.keys.grep(/^alert:#{host}:#{check}:\d/))
end

So now I can show the two methods that will actually publish the alerts:

The first notifies of issues but only every @interval seconds and it uses the alert_ttl helper above to determine if it should or shouldn’t send:

def notify_issue(exitcode, host, check, output)
  if (ttl = @storage.alert_ttl(host, check, exitcode)) == -1
    subject = "%s %s#%s" % [status_for_code(exitcode), host, check]
    message = "%s: %s" % [subject, output]
 
    send(message, subject, @recipients)
 
    @redis.record_alert(host, check, exitcode, @alert_interval)
  else
    Log.info("Not alerting %s#%s due to interval restrictions, next alert in %d seconds" % [host, check, ttl])
  end
end

The second will publish recovery notices and we’d always want those and they will not repeat, here we clear all the previous alert tracking to avoid incorrect alert surpressions:

def notify_recovery(host, check, output)
  subject = "RECOVERY %s#%s" % [host, check]
  message = "%s: %s" % [subject, output]
 
  send_alert(message, subject, @recipients)
 
  @redis.clear_alert_ttls(host, check)
end

References: SET, EXPIRE, SUBSCRIBE, TTL, DEL

Conclusion

This covered a few Redis basics but it’s a very rich system that can be used in many areas so if you are interested spend some quality time with its docs.

Using its facilities saved me a ton of effort while working on a small monitoring system. It is fast and light weight and enable cross language collaboration that I’d have found hard to replicate in a performant manner without it.

Scaling Nagios NRPE checks

by R.I. Pienaar | Jan 1, 2013 | Code

Most Nagios systems does a lot of forking especially those built around something like NRPE where each check is a connection to be made to a remote system. On one hand I like NRPE in that it puts the check logic on the nodes using a standard plugin format and provides a fairly re-usable configuration file but on the other hand the fact that the Nagios machine has to do all this forking has never been good for me.

In the past I’ve shown one way to scale checks by aggregate all results for a specific check into one result but this is not always a good fit as pointed out in the post. I’ve now built a system that use the same underlying MCollective infrastructure as in the previous post but without the aggregation.

I have a pair of Nagios nodes – one in the UK and one in France – and they are on quite low spec VMs doing around 400 checks each. The problems I have are:

The machines are constantly loaded under all the forking, one would sit on 1.5 Load Average almost all the time
They use a lot of RAM and it’s quite spikey, if something is wrong especially I’d have a lot of checks concurrently so the machines have to be bigger than I want them
The check frequency is quite low in the usual Nagios manner, sometimes 10 minutes can go by without a check
The check results do not represent a point in time, I have no idea how the check results of node1 relate to those on node2 as they can be taken anywhere in the last 10 minutes

These are standard Nagios complaints though and there are many more but these ones specifically is what I wanted to address right now with the system I am showing here.

Probably not a surprise but the solution is built on MCollective, it uses the existing MCollective NRPE agent and the existing queueing infrastructure to push the forking to each individual node – they would do this anyway for every NRPE check – and read the results off a queue and spool it into the Nagios command file as Passive results. Internally it splits the traditional MCollective request-response system into a async processing system using the technique I blogged about before.

As you can see the system is made up of a few components:

The Scheduler takes care of publishing requests for checks
MCollective and the middleware provides AAA and transport
The nodes all run the MCollective NRPE agent which put their replies on the Queue
The Receiver reads the results from the Queue and write them to the Nagios command file

The Scheduler

The scheduler daemon is written using the excellent Rufus Scheduler gem – if you do not know it you totally should check it out, it solves many many problems. Rufus allows me to create simple checks on intervals like 60s and I combine these checks with MCollective filters to create a simple check configuration as below:

nrpe 'check_bacula_main', '6h', 'bacula::node monitored_by=monitor1'
nrpe 'check_disks', '60s', 'monitored_by=monitor1'
nrpe 'check_greylistd', '60s', 'greylistd monitored_by=monitor1'
nrpe 'check_load', '60s', 'monitored_by=monitor1'
nrpe 'check_mailq', '60s', 'monitored_by=monitor1'
nrpe 'check_mongodb', '60s', 'mongodb monitored_by=monitor1'
nrpe 'check_mysql', '60s', 'mysql::server monitored_by=monitor1'
nrpe 'check_pki', '60m', 'monitored_by=monitor1'
nrpe 'check_swap', '60s', 'monitored_by=monitor1'
nrpe 'check_totalprocs', '60s', 'monitored_by=monitor1'
nrpe 'check_zombieprocs', '60s', 'monitored_by=monitor1'

Taking the first line it says: Run the check_bacula_main NRPE check every 6 hours on machines with the bacula::node Puppet Class and with the fact monitored_by=monitor1. I had the monitored_by fact already to assist in building my Nagios configs using a simple search based approach in Puppet.

When the scheduler starts it will log:

W, [2012-12-31T22:10:12.186789 #32043]  WARN -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 0 to stomp://nagios@stomp.example.net:6163
W, [2012-12-31T22:10:12.193405 #32043]  WARN -- : activemq.rb:101:in `on_connected' Conncted to stomp://nagios@stomp.example.net:6163
I, [2012-12-31T22:10:12.196387 #32043]  INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_bacula_main every 6h matching 'bacula::node monitored_by=monitor1', first in 19709s
I, [2012-12-31T22:10:12.196632 #32043]  INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_disks every 60s matching 'monitored_by=monitor1', first in 57s
I, [2012-12-31T22:10:12.197173 #32043]  INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_load every 60s matching 'monitored_by=monitor1', first in 23s
I, [2012-12-31T22:10:35.326301 #32043]  INFO -- : scheduler.rb:26:in `nrpe' Publishing request for check_load with filter 'monitored_by=monitor1'

You can see it reads the file and schedule the first check a random interval between now and the interval window this spread out the checks.

The Receiver

The receiver has almost no config, it just need to know what queue to read and where your Nagios command file lives, it logs:

I, [2013-01-01T11:49:38.295661 #23628]  INFO -- : mnrpes.rb:35:in `daemonize' Starting in the background
W, [2013-01-01T11:49:38.302045 #23631]  WARN -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 0 to stomp://nagios@stomp.example.net:6163
W, [2013-01-01T11:49:38.310853 #23631]  WARN -- : activemq.rb:101:in `on_connected' Conncted to stomp://nagios@stomp.example.net:6163
I, [2013-01-01T11:49:38.310980 #23631]  INFO -- : receiver.rb:16:in `subscribe' Subscribing to /queue/mcollective.nagios_passive_results_monitor1
I, [2013-01-01T11:49:41.572362 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040981] PROCESS_SERVICE_CHECK_RESULT;node1.example.net;mongodb;0;OK: connected, databases admin local my_db puppet mcollective
I, [2013-01-01T11:49:42.509061 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040982] PROCESS_SERVICE_CHECK_RESULT;node2.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z
I, [2013-01-01T11:49:42.510574 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040982] PROCESS_SERVICE_CHECK_RESULT;node3.example.net;zombieprocs;0;PROCS OK: 1 process with STATE = Z

As the results get pushed to Nagios I see the following in its logs:

[1357042122] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node1.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z
[1357042124] PASSIVE SERVICE CHECK: node1.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z

Did it solve my problems?

I listed the set of problems I wanted to solve so it’s worth evaluating if I did solve them properly.

Less load and RAM use on the Nagios nodes

My Nagios nodes have gone from load averages of 1.5 to 0.1 or 0.0, they are doing nothing, they use a lot less RAM and I have removed some of the RAM from the one and given it to my Jenkins VM instead, it was a huge win. The sender and receiver is quite light on resources as you can see below:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nagios    9757  0.4  1.8 130132 36060 ?        S     2012   3:41 ruby /usr/bin/mnrpes-receiver --pid=/var/run/mnrpes/mnrpes-receiver.pid --config=/etc/mnrpes/mnrpes-receiver.cfg
nagios    9902  0.3  1.4 120056 27612 ?        Sl    2012   2:22 ruby /usr/bin/mnrpes-scheduler --pid=/var/run/mnrpes/mnrpes-scheduler.pid --config=/etc/mnrpes/mnrpes-scheduler.cfg

On the RAM side I now never get a pile up of many checks. I do have the stale detection enabled on my Nagios template so if something breaks in the scheduler/receiver/broker triplet Nagios will still try to do a traditional check to see what’s going on but that’s bearable.

Check frequency too low

With this system I could do my checks every 10 seconds without any problems, I settled on 60 seconds as that’s perfect for me. Rufus scheduler does a great job of managing that and the requests from the scheduler are effectively fire and forget as long as the broker is up.

Results are spread over 10 minutes

The problem with the results for load on node1 and node2 having no temporal correlation is gone too now, because I use MCollectives parallel nature all the load checks happen at the same time:

Here is the publisher:

I, [2013-01-01T12:00:14.296455 #20661]  INFO -- : scheduler.rb:26:in `nrpe' Publishing request for check_load with filter 'monitored_by=monitor1'

…and the receiver:

I, [2013-01-01T12:00:14.380981 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node1.example.net;load;0;OK - load average: 0.92, 0.54, 0.42|load1=0.920;9.000;10.000;0; load5=0.540;8.000;9.000;0; load15=0.420;7.000;8.000;0; 
I, [2013-01-01T12:00:14.383875 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node2.example.net;load;0;OK - load average: 0.00, 0.00, 0.00|load1=0.000;1.500;2.000;0; load5=0.000;1.500;2.000;0; load15=0.000;1.500;2.000;0; 
I, [2013-01-01T12:00:14.387427 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node3.example.net;load;0;OK - load average: 0.02, 0.07, 0.07|load1=0.020;1.500;2.000;0; load5=0.070;1.500;2.000;0; load15=0.070;1.500;2.000;0; 
I, [2013-01-01T12:00:14.388754 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node4.example.net;load;0;OK - load average: 0.07, 0.02, 0.00|load1=0.070;1.500;2.000;0; load5=0.020;1.500;2.000;0; load15=0.000;1.500;2.000;0; 
I, [2013-01-01T12:00:14.404650 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node5.example.net;load;0;OK - load average: 0.03, 0.09, 0.04|load1=0.030;1.500;2.000;0; load5=0.090;1.500;2.000;0; load15=0.040;1.500;2.000;0; 
I, [2013-01-01T12:00:14.405689 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node6.example.net;load;0;OK - load average: 0.06, 0.06, 0.07|load1=0.060;3.000;4.000;0; load5=0.060;3.000;4.000;0; load15=0.070;3.000;4.000;0; 
I, [2013-01-01T12:00:14.489590 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node7.example.net;load;0;OK - load average: 0.06, 0.14, 0.14|load1=0.060;1.500;2.000;0; load5=0.140;1.500;2.000;0; load15=0.140;1.500;2.000;0;

I, [2013-01-01T12:00:14.380981 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node1.example.net;load;0;OK - load average: 0.92, 0.54, 0.42|load1=0.920;9.000;10.000;0; load5=0.540;8.000;9.000;0; load15=0.420;7.000;8.000;0; I, [2013-01-01T12:00:14.383875 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node2.example.net;load;0;OK - load average: 0.00, 0.00, 0.00|load1=0.000;1.500;2.000;0; load5=0.000;1.500;2.000;0; load15=0.000;1.500;2.000;0; I, [2013-01-01T12:00:14.387427 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node3.example.net;load;0;OK - load average: 0.02, 0.07, 0.07|load1=0.020;1.500;2.000;0; load5=0.070;1.500;2.000;0; load15=0.070;1.500;2.000;0; I, [2013-01-01T12:00:14.388754 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node4.example.net;load;0;OK - load average: 0.07, 0.02, 0.00|load1=0.070;1.500;2.000;0; load5=0.020;1.500;2.000;0; load15=0.000;1.500;2.000;0; I, [2013-01-01T12:00:14.404650 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node5.example.net;load;0;OK - load average: 0.03, 0.09, 0.04|load1=0.030;1.500;2.000;0; load5=0.090;1.500;2.000;0; load15=0.040;1.500;2.000;0; I, [2013-01-01T12:00:14.405689 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node6.example.net;load;0;OK - load average: 0.06, 0.06, 0.07|load1=0.060;3.000;4.000;0; load5=0.060;3.000;4.000;0; load15=0.070;3.000;4.000;0; I, [2013-01-01T12:00:14.489590 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node7.example.net;load;0;OK - load average: 0.06, 0.14, 0.14|load1=0.060;1.500;2.000;0; load5=0.140;1.500;2.000;0; load15=0.140;1.500;2.000;0;

All the results are from the same second, win.

Conclusion

So my scaling issues on my small site is solved and I think the way this is built will work for many people. The code is on GitHub and requires MCollective 2.2.0 or newer.

Having reused the MCollective and Rufus libraries for all the legwork including logging, daemonizing, broker connectivity, addressing and security I was able to build this in a very short time, the total code base is only 237 lines excluding packaging etc. which is a really low number of lines for what it does.

Simple Puppet Module Structure Redux

by R.I. Pienaar | Dec 13, 2012 | Code

Back in September 2009 I wrote a blog post titled “Simple Puppet Module Structure” which introduced a simple approach to writing Puppet Modules. This post has been hugely popular in the community – but much has changed in Puppet since then so it is time for an updated version of that post.

As before I will show a simple module for a common scenario. Rather than considering this module a blueprint for every module out there you should instead study its design and use it as a starting point when writing your own modules. You can build on it and adapt it but the basic approach should translate well to more complex modules.

I should note that while I work for Puppet Labs I do not know if this reflect any kind of standard suggested approach by Puppet Labs – this is what I do when managing my own machines and no more.

The most important deliverables

When writing a module I have a few things I keep in mind – these are all centered around down stream users of my module and future-me trying to figure out what is going on:

A module should have a single entry point where someone reviewing it can get an overview of it’s behavior
Modules that have configuration should be configurable in a single way and single place
Modules should be made up of several single-responsibility classes. As far as possible these classes should be private details hidden from the user
For the common use cases, users should not need to know individual resource names
For the most common use case, users should not need to provide any parameters, defaults should be used
Modules I write should have a consistant design and behaviour

The module layout I will present below is designed so that someone who is curious about the behaviour of the module only have to look in the init.pp to see:

All the parameters and their defaults used to configure the behaviour of the module
Overview of the internal structure of the module by way of descriptive class names
Relationships and notifications that exist inside the module and what classes they can notify

This design will never remove the need for documenting your modules but a clear design will guide your users in discovering the internals of your module and how they interact with it.

More important than what a module does is how accessible it is to you and others, how easy is it to understand, debug and extend.

Thinking about your module

For this post I will write a very simple module to manage NTP – it really is very simple, you should check the Forge for more complete ones.

To go from nowhere to having NTP on your machine you would have to do:

Install the packages and any dependencies
Write out appropriate configuration files with some environment specific values
Start the service or services you need once the configuration files are written. Restart it if the config file change later.

There is a clear implied dependency chain here and this basic pattern applies to most pieces of software.

These 3 points basically translate to distinct groups of actions and sticking with the above principal of single function classes I will create a class for each group.

To keep things clear and obvious I will call these class install, config and service. The names don’t matter as long as they are descriptive – but you really should pick something and stick with it in all your modules.

Writing the module
I’ll show the 3 classes that does the heavy lifting here and discuss parts of them afterwards:

class ntp::install {
   package{'ntpd':
      ensure => $ntp::version
   }
}
 
class ntp::config {
   $ntpservers = $ntp::ntpservers
 
   File{
      owner   => root,
      group   => root,
      mode    => 644,
   }
 
   file{'/etc/ntp.conf':
         content => template('ntp/ntp.conf.erb');
 
        '/etc/ntp/step-tickers':
         content => template('ntp/step-tickers.erb');
    }
}
 
class ntp::service {
   $ensure = $ntp::start ? {true => running, default => stopped}
 
   service{"ntp":
      ensure  => $ensure,
      enable  => $ntp::enable,
   }
}

Here I have 3 classes that serve a single purpose each and do not have any details like relationships, ordering or notifications in them. They roughly just do the one thing they are supposed to do.

Take a look at each class and you will see they use variables like $ntp::version, $ntp::ntpservers etc. These are variables from the the main ntp class, lets take a quick look at that class:

# == Class: ntp
#
# A basic module to manage NTP
#
# === Parameters
# [*version*]
#   The package version to install
#
# [*ntpservers*]
#   An array of NTP servers to use on this node
#
# [*enable*]
#   Should the service be enabled during boot time?
#
# [*start*]
#   Should the service be started by Puppet
class ntp(
   $version = "present",
   $ntpservers = ["1.pool.ntp.org", "2.pool.ntp.org"],
   $enable = true,
   $start = true
) {
   class{'ntp::install': } ->
   class{'ntp::config': } ~>
   class{'ntp::service': } ->
   Class["ntp"]
}

This is the main entry point into the module that was mentioned earlier. All the variables the module use is documented in a single place, the basic design and parts of the module is clear and you can see that the service class can be notified and the relationships between the parts.

I use the new chaining features to inject the dependencies and relationships here which surfaces these important interactions between the various classes back up to the main entry class for users to see easily.

All this information is immediately available in the obvious place without looking at any additional files or by being bogged down with implementation details.

Line 26 here requires some extra explanation – This ensures that all the NTP member classes are applied before this main NTP class so that cases where someone say require => Class[“ntp”] elsewhere they can be sure the associated tasks are completed. This is a light weight version of the Anchor Pattern.

Using the module

Let’s look at how you might use this module from knowing nothing.

Ideally simply including the main entry point on a node should be enough:

include ntp

This does what you’d generally expect – installs, configures and starts the NTP service.

After looking at the init.pp you can now supply some new values for some of the parameters to tune it for your needs:

class{"ntp": ntpservers => ["ntp1.example.com", "ntp2.example.com"]}

Or you can use the new data bindings in Puppet 3 and supply new data in Hiera to override these variables by supplying data for the keys like ntp::ntpservers.

Finally if for some or other related reason you need to restart the service you know from looking at the ntp class that you can notify the ntp::service class to achieve that.

Using classes for relationships

There’s a huge thing to note here in the main ntp class. I specify all relationships and notifies on the classes and not the resources themselves.

As personal style I only mention resources by name inside a class that contains that resource – if I ever have to access a resource outside of the class that it is contained in I access the class.

I would not write:

class ntp::service {
   service{"ntp": require => File["/etc/ntp.conf"]}
}

These are many issues with this approach that mostly come down to maintenance headaches. Here I require the ntp config file but what if a service have more than one file? Do you then list all the files? Do you later edit every class that reference these when another file gets managed?

These issues quickly multiply in a large code base. By always acting on class names and by creating many small single purpose classes as here I effectively contain these by grouping names and not individual resource names. This way any future refactoring of individual classes would not have an impact on other classes.

So the above snippet would rather be something like this:

class ntp::service {
   service{"ntp": require => Class["ntp::config"]}
}

Here I require the containing class and not the resource. This has the effect of requiring all resources inside that class. This has the effect of isolating changes to that class and avoiding a situation where users have to worry about the internal implementation details of the other class. Along the same lines you can also notify a class – and all resources inside that class gets notified.

I only include other classes at the top ntp level and never have include statements in my classes like ntp::confg and so forth – this means when I require the class ntp::config or notify ntp::service I get just what I want and no more.

If you create big complex classes you run the risk of having refreshonly execs that relate to configuration or installation associated with services in the same class which would have disastrous consequences if you notify the wrong thing or if a user do not study your code before using it.

A consistant style of small single purpose classes named descriptively avoid these and other problems.

What we learned and further links

There is a lot to learn here and much of it is about soft issues like the value of consistency and clarity of design and thinking about your users – and your future self.

On the technical side you should learn about the effects of relationships and notifications based on containing classes and not by naming resources by name.

And we came across a number of recently added Puppet features:

Parameterized classes
Chaining Arrows
Data Bindings as introduced in Puppet 3

Parameterized Classes are used to provide multiple convenient methods for supplying data to your module – defaults in the module, specifically in code, using Hiera and (not shown here) an ENC.

Chaining Arrows are used in the main class to inject the dependencies and notifications in a way that is visible without having to study each individual class.

These are important new additions to Puppet. Some new features like Parameterised classes are not quite ready for prime time imho but in Puppet 3 when combined with the data bindings a lot of the pain points have been removed.

Finally there are a number of useful things I did not mention here. Specifically you should study the Puppet Style Guide and use the Puppet Lint tool to validate your modules comply. You should consider writing tests for your modules using rspec-puppet and finally share it on the Puppet Forge.

And perhaps most importantly – do not reinvent the wheel, check the Forge first.

« Older Entries

Next Entries »

Managing Puppet Using MCollective

Graphing on the CLI

Solving monitoring state storage problems using Redis

Redis Data Types

Monitoring Needs

Check Status

Staleness Tracking

Event Notification

Alert Tracking

Conclusion

Scaling Nagios NRPE checks

The Scheduler

The Receiver

Did it solve my problems?

Less load and RAM use on the Nagios nodes

Check frequency too low

Results are spread over 10 minutes

Conclusion

Simple Puppet Module Structure Redux

The most important deliverables

Thinking about your module

Writing the module I’ll show the 3 classes that does the heavy lifting here and discuss parts of them afterwards:

Using the module

Using classes for relationships

What we learned and further links

Licence

Writing the module
I’ll show the 3 classes that does the heavy lifting here and discuss parts of them afterwards: