by R.I. Pienaar | Dec 26, 2014 | Code
Since 1999 I kept record of most places I’ve traveled to. In the old days I used a map viewer from PARC Xerox to view these travels, I then used XPlanet which made a static image. Back in 2005 as Google Maps became usable from Javascript I made something to show my travels on an interactive map. It was using Gmaps EZ and PHP to draw points from a XML file.
Since then google made their v2 API defunct and something went bad with the old php code and so the time came to revisit all of this into the 4th iteration of a map tracking my travels.
Google Earth came out in 2005 as well – so just a bit late for me to use it’s data formats – but today it seems obvious that the data belong in a KML file. Hand building KML files though is not on, so I needed something to build the KML file in Ruby.
My new app maintains points in YAML files, have more or less an identical format to the old PHP system.
First to let people come up with categories of points you define a bunch of types of points first:
:types:
:visit:
:icon: http://your.site/markers/mini-ORANGE-BLANK.png
:transit:
:icon: http://your.site/markers/mini-BLUE-BLANK.png
:lived:
:icon: http://your.site/markers/mini-GREEN-BLANK.png |
:types:
:visit:
:icon: http://your.site/markers/mini-ORANGE-BLANK.png
:transit:
:icon: http://your.site/markers/mini-BLUE-BLANK.png
:lived:
:icon: http://your.site/markers/mini-GREEN-BLANK.png
And then we have a series of points each referencing a type:
:points:
- :type: :visit
:lon: -73.961334
:title: New York
:lat: 40.784506
:country: United States
:comment: Sample Data
:href: http://en.wikipedia.org/wiki/New_York
:linktext: Wikipedia
- :type: :transit
:lon: -71.046524
:title: Boston
:lat: 42.363871
:country: United States
:comment: Sample Data
:href: http://en.wikipedia.org/wiki/Boston
:linkimg: https://pbs.twimg.com/profile_images/430836891198320640/_-25bnPr.jpeg |
:points:
- :type: :visit
:lon: -73.961334
:title: New York
:lat: 40.784506
:country: United States
:comment: Sample Data
:href: http://en.wikipedia.org/wiki/New_York
:linktext: Wikipedia
- :type: :transit
:lon: -71.046524
:title: Boston
:lat: 42.363871
:country: United States
:comment: Sample Data
:href: http://en.wikipedia.org/wiki/Boston
:linkimg: https://pbs.twimg.com/profile_images/430836891198320640/_-25bnPr.jpeg
Here we have 2 points, both link to Wikipedia one using text and one using an image, one is a visit and one is a transit.
I use the ruby_kml Gem to convert this into KML:
First we set up the basic document and we set up the types using KML styles:
kml = KMLFile.new
document = KML::Document.new(:name => "Travlrmap Data")
@config[:types].each do |k, t|
document.styles << KML::Style.new(
:id => "travlrmap-#{k}-style",
:icon_style => KML::IconStyle.new(:icon => KML::Icon.new(:href => t[:icon]))
)
end |
kml = KMLFile.new
document = KML::Document.new(:name => "Travlrmap Data")
@config[:types].each do |k, t|
document.styles << KML::Style.new(
:id => "travlrmap-#{k}-style",
:icon_style => KML::IconStyle.new(:icon => KML::Icon.new(:href => t[:icon]))
)
end
This sets up the types and give them names like travlrmap-visited-style.
We’ll now reference these in the KML file for each point:
folder = KML::Folder.new(:name => "Countries")
folders = {}
@points.sort_by{|p| p[:country]}.each do |point|
unless folders[point[:country]]
folder.features << folders[point[:country]] = KML::Folder.new(:name => point[:country])
end
folders[point[:country]].features << KML::Placemark.new(
:name => point[:title],
:description => point_comment(point),
:geometry => KML::Point.new(:coordinates => {:lat => point[:lat], :lng => point[:lon]}),
:style_url => "#travlrmap-#{point[:type]}-style"
)
end
document.features << folder
kml.objects << document
kml.render |
folder = KML::Folder.new(:name => "Countries")
folders = {}
@points.sort_by{|p| p[:country]}.each do |point|
unless folders[point[:country]]
folder.features << folders[point[:country]] = KML::Folder.new(:name => point[:country])
end
folders[point[:country]].features << KML::Placemark.new(
:name => point[:title],
:description => point_comment(point),
:geometry => KML::Point.new(:coordinates => {:lat => point[:lat], :lng => point[:lon]}),
:style_url => "#travlrmap-#{point[:type]}-style"
)
end
document.features << folder
kml.objects << document
kml.render
The points are put in folders by individual country. So in Google Earth I get a nice list of countries to enable/disable as I please etc.
I am not showing how I create the comment html here – it’s the point_comment method – it’s just boring code with a bunch of if’s around linkimg, linktext and href. KML documents does not support all of HTML but the basics are there so this is pretty easy.
So this is the basics of making a KML file from your own data, it’s fairly easy though the docs for ruby_kml isn’t that hot and specifically don’t tell you that you have to wrap all the points and styles and so forth in a document as I have done here – it seems a recent requirement of the KML spec though.
Next up we have to get this stuff onto a google map in a browser. As KML is the format Google Earth uses it’s safe to assume the Google Maps API support this stuff directly. Still, a bit of sugar around the Google APIs are nice because they can be a bit verbose. Previously I used GMapsEZ – which I ended up really hating as the author did all kinds of things like refuse to make it available for download instead hosting it on a unstable host. Now I’d say you must use gmaps.js to make it real easy.
For viewing a KML file, you basically just need this – more or less directly from their docs – there’s some ERB template stuff in here to set up the default view port etc:
<script type="text/javascript">
var map;
$(document).ready(function(){
infoWindow = new google.maps.InfoWindow({});
map = new GMaps({
div: '#main_map',
zoom: <%= @map_view[:zoom] %>,
lat: <%= @map_view[:lat] %>,
lng: <%= @map_view[:lon] %>,
});
map.loadFromKML({
url: 'http://your.site/kml',
suppressInfoWindows: true,
preserveViewport: true,
events: {
click: function(point){
infoWindow.setContent(point.featureData.infoWindowHtml);
infoWindow.setPosition(point.latLng);
infoWindow.open(map.map);
}
}
});
});
</script> |
<script type="text/javascript">
var map;
$(document).ready(function(){
infoWindow = new google.maps.InfoWindow({});
map = new GMaps({
div: '#main_map',
zoom: <%= @map_view[:zoom] %>,
lat: <%= @map_view[:lat] %>,
lng: <%= @map_view[:lon] %>,
});
map.loadFromKML({
url: 'http://your.site/kml',
suppressInfoWindows: true,
preserveViewport: true,
events: {
click: function(point){
infoWindow.setContent(point.featureData.infoWindowHtml);
infoWindow.setPosition(point.latLng);
infoWindow.open(map.map);
}
}
});
});
</script>
Make sure there’s a main_map div setup with your desired size and the map will show up there. Really easy.
You can see this working on my new travel site at travels.devco.net. The code is on Github as usual but it’s a bit early days for general use or release. The generated KML file can be fetched here.
Right now it supports a subset of older PHP code features – mainly drawing lines is missing. I hope to add a way to provide some kind of index to GPX files to show tracks as I have a few of those. Turning a GPX file into a KML file is pretty easy and the above JS code should show it without modification.
I’ll post a follow up here once the code is sharable, if you’re brave though and know ruby you can grab the travlrmap gem to install your own.
by R.I. Pienaar | Dec 9, 2013 | Code, Uncategorized
My recent post about using Hiera data in modules has had a great level of discussion already, several thousand blog views, comments, tweets and private messages on IRC. Thanks for the support and encouragement – it’s clear this is a very important topic.
I want to expand on yesterdays post by giving some background information on the underlying motivations that caused me to write this feature and why having it as a forge module is highly undesirable but the only current option.
At the heart of this discussion is the params.pp pattern and general problems with it. To recap, the basic idea is to embed all your default data into a file params.pp typically in huge case statements and then reference this data as default. Some examples of this are the puppetlabs-ntp module, the Beginners Guide to Modules and the example I had in the previous post that I’ll reproduce below:
# ntp/manifests/init.pp
class ntp (
# allow for overrides using resource syntax or data bindings
$config = $ntp::params::config,
$keys_file = $ntp::params::keys_file
) inherits ntp::params {
# validate values supplied
validate_absolute_path($config)
validate_absolute_path($keys_file)
# optionally derive new data from supplied data
# use data
file{$config:
....
}
} |
# ntp/manifests/init.pp
class ntp (
# allow for overrides using resource syntax or data bindings
$config = $ntp::params::config,
$keys_file = $ntp::params::keys_file
) inherits ntp::params {
# validate values supplied
validate_absolute_path($config)
validate_absolute_path($keys_file)
# optionally derive new data from supplied data
# use data
file{$config:
....
}
}
# ntp/manifests/params.pp
class ntp::params {
# set OS specific values
case $::osfamily {
'AIX': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp.keys'
}
'Debian': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp/keys'
}
'RedHat': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp/keys'
}
default: {
fail("The ${module_name} module is not supported on an ${::osfamily} based system.")
}
}
} |
# ntp/manifests/params.pp
class ntp::params {
# set OS specific values
case $::osfamily {
'AIX': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp.keys'
}
'Debian': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp/keys'
}
'RedHat': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp/keys'
}
default: {
fail("The ${module_name} module is not supported on an ${::osfamily} based system.")
}
}
}
Now today as Puppet stands this is pretty much the best we can hope for. This achieves a lot of useful things:
- The data that provides OS support is contained and separate
- You can override it using resource style syntax or Puppet 3 data bindings
- The data provided using any means are validated
- New data can be derived by combining supplied or default data
You can now stick this module on the forge and users can use it, it supports many Operating Systems and pretty much works on any Puppet going back quite a way. These are all good things.
The list above also demonstrates the main purpose for having data in a module – different OS/environment support, allowing users to supply their own data, validation and to transmogrify the data. The params.pp pattern achieves all of this.
So what’s the problem then?
The problem is: the data is in the code. In the pre extlookup and Hiera days we put our site data in a case statements or inheritance trees or node data or any of number of different solutions. These all solved the basic problem – our site got configured and our boxes got built just like the params.pp pattern solves the basic problem. But we wanted more, we wanted our data separate from our code. Not only did it seem natural because almost every other known programming language supports and embrace this but as Puppet users we wanted a number of things:
- Less logic, syntax, punctuation and “programming” and more just files that look a whole lot like configuration
- Better layering than inheritance and other tools at our disposal allowed. We want to structure our configuration like we do our DCs and environments and other components – these form a natural series of layered hierarchies.
- We do not want to change code when we want to use it, we want to configure that code to behave according to our site needs. In a CM world data is configuration.
- If we’re in a environment that do not let us open source our work or contribute to open source repositories we do not want to be forced to fork and modify open source code just to use it in our environments. We want to configure the code. Compliance needs should not force us to solve every problem in house.
- We want to plug into existing data sources like LDAP or be able to create self service portals for our users to supply this configuration data. But we do not want to change our manifests to achieve this.
- We do not want to be experts at using source control systems. We use them, we love them and agree they are needed. But like everything less is more. Simple is better. A small simple workflow we can manage at 2am is better than a complex one.
- We want systems we can reason about. A system that takes configuration in the form of data trumps one that needs programming to change its behaviour
- Above all we want a system that’s designed with our use cases in mind. Our User Experience needs are different from programmers. Our data needs are different and hugely complex. Our CM system must both guide in its design and be compatible with our existing approaches. We do not want to have to write our own external node data sources simply because our language do not provide solid solutions to this common problem.
I created Hiera with these items in mind after years of talking to probably 1000+ users and iterating on extlookup in order to keep pace with the Puppet language gaining support for modern constructs like Hashes. True it’s not a perfect solution to all these points – transparency of data origin to name but one – but there are approaches to make small improvements to achieve these and it does solve a high % of the above problems.
Over time Hiera has gained a tremendous following – it’s now the de facto standard to solving the problem of site configuration data largely because it’s pragmatic, simple and designed to suit the task at hand. In recognition of this I donated the code to Puppet Labs and to their credit they integrated it as a default prerequisite and created the data binding systems. The elephant in the room is our modules though.
We want to share our modules with other users. To do this we need to support many operating systems. To do this we need to create a lot of data in the modules. We can’t use Hiera to do this in a portable fashion because the module system needs improvement. So we’re stuck in the proverbial dark ages by embedding our data in code and gaining none of the advantages Hiera brings to site data.
Now we have a few options open to us. We can just suck it up and keep writing params.pp files gaining none of the above advantages that Hiera brings. This is not great and the puppetlabs-ntp module example I cited shows why. We can come up with ever more elaborate ways to wrap and extend and override the data provided in a params.pp or even far out ideas like having the data binding system query the params.pp data directly. In other words we can pander to the status quo, we can assume we cannot improve the system instead we have to iterate on an inherently bad idea. The alternative is to improve Puppet.
Every time the question of params.pp comes up the answer seems to be how to improve how we embed data in the code. This is absolutely the wrong answer. The answer should be how do we improve Puppet so that we do not have to embed data in code. We know people want this, the popularity and wide adoption of Hiera has shown that they do. The core advantages of Hiera might not be well understood by all but the userbase do understand and treasure the gains they get from using it.
Our task is to support the community in the investment they made in Hiera. We should not be rewriting it in a non backwards compatible way throwing away past learnings simply because we do not want to understand how we got here. We should be iterating with small additions and rounding out this feature as one solid ever present data system that every user of Puppet can rely on being present on every Puppet system.
Hiera adoption has reached critical mass, it’s now the solution to the problem. This is a great and historical moment for the Puppet Community, to rewrite it or throw it away or propose orthogonal solutions to this problem space is to do a great disservice to the community and the Puppet product as a whole.
Towards this I created a Hiera backend that goes a way to resolve this in a way thats a natural progression of the design of Hiera. It improves the core features provided by Puppet in a way that will allow better patterns than the current params.pp one to be created that will in the long run greatly improve the module writing and sharing experience. This is what my previous blog post introduce, a way forward from the current params.pp situation.
Now by rights a solution to this problem belong in Puppet core. A Puppet Forge dependant module just to get this ability, especially one not maintained by Puppet Labs, especially one that monkey patches its way into the system is not desirable at all. This is why the code was a PR first. The only alternatives are to wait in the dark – numerous queries by many members of the community to the Puppet product owner has yielded only vague statements of intent or outcome. Or we can take it on our hands to improve the system.
So I hope the community will support me in using this module and work with me to come up with better patterns to replace the params.pp ones. Iterating on and improving the system as a whole rather than just suck up the status quo and not move forward.
by R.I. Pienaar | Dec 8, 2013 | Code
When writing Puppet Modules there tend to be a ton of configuration data – generally things like different paths for different operating systems. Today the general pattern to manage this data is a class module::param with a bunch of logic in it.
Here’s a simplistic example below – for an example of the full horror of this pattern see the puppetlabs-ntp module.
# ntp/manifests/init.pp
class ntp (
$config = $ntp::params::config,
$keys_file = $ntp::params::keys_file
) inherits ntp::params {
file{$config:
....
}
} |
# ntp/manifests/init.pp
class ntp (
$config = $ntp::params::config,
$keys_file = $ntp::params::keys_file
) inherits ntp::params {
file{$config:
....
}
}
# ntp/manifests/params.pp
class ntp::params {
case $::osfamily {
'AIX': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp.keys'
}
'Debian': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp/keys'
}
'RedHat': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp/keys'
}
default: {
fail("The ${module_name} module is not supported on an ${::osfamily} based system.")
}
}
} |
# ntp/manifests/params.pp
class ntp::params {
case $::osfamily {
'AIX': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp.keys'
}
'Debian': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp/keys'
}
'RedHat': {
$config = "/etc/ntp.conf"
$keys_file = '/etc/ntp/keys'
}
default: {
fail("The ${module_name} module is not supported on an ${::osfamily} based system.")
}
}
}
This is the exact reason Hiera exists – to remove this kind of spaghetti code and move it into data, instinctively now whenever anyone see code like this they think they should refactor this and move the data into Hiera.
But there’s a problem. This works for your own modules in your own repos, you’d just use the Puppet 3 automatic parameter bindings and override the values in the ntp class – not ideal, but many people do it. If however you wanted to write a module for the Forge though there’s a hitch because the module author has no idea what kind of hierarchy exist where the module is used. If the site even used Hiera and today the module author can’t ship data with his module. So the only sensible thing to do is to embed a bunch of data in your code – the exact thing Hiera is supposed to avoid.
I proposed a solution to this problem that would allow module authors to embed data in their modules as well as control the Hierarchy that would be used when accessing this data. Unfortunately a year on we’re still nowhere and the community – and the forge – is suffering as a result.
The proposed solution would be a always-on Hiera backend that as a last resort would look for data inside the module. Critically the module author controls the hierarchy when it gets to the point of accessing data in the module. Consider the ntp::params class above, it is a code version of a Hiera Hierarchy keyed on the $::osfamily fact. But if we just allowed the module to supply data inside the module then the module author has to just hope that everyone has this tier in their hierarchy – not realistic. My proposal then adds a module specific Hierarchy and data that gets consulted after the site Hierarchy.
So lets look at how to rework this module around this proposed solution:
# ntp/manifests/init.pp
class ntp ($config, $keysfile) {
validate_absolute_path($config)
validate_absolute_path($keysfile)
file{$config:
....
}
} |
# ntp/manifests/init.pp
class ntp ($config, $keysfile) {
validate_absolute_path($config)
validate_absolute_path($keysfile)
file{$config:
....
}
}
Next you configure Hiera to consult a hierarchy on the $::osfamily fact, note the new data directory that goes inside the module:
# ntp/data/hiera.yaml
---
:hierarchy:
- "%{::osfamily}" |
# ntp/data/hiera.yaml
---
:hierarchy:
- "%{::osfamily}"
And finally we create some data files, here’s just the one for RedHat:
# ntp/data/RedHat.yaml
---
ntp::config: /etc/ntp.conf
ntp::keys_file: /etc/ntp/keys |
# ntp/data/RedHat.yaml
---
ntp::config: /etc/ntp.conf
ntp::keys_file: /etc/ntp/keys
Users of the module could add a new OS without contributing back to the module or forking the module by simply providing similar data to the site specific hierarchy leaving the downloaded module 100% untouched!
This is a very simple view of what this pattern allows, time will tell what the community makes of it. There are many advantages to this over the ntp::params pattern:
This helps the contributor to a public module:
- Adding a new OS is easy, just drop in a new YAML file. This can be done with confidence as it will not break existing code as it will only be read on machines of the new OS. No complex case statements or 100s of braces to get right
- On a busy module when adding a new OS they do not have to worry about complex merge problems, working hard at rebasing or any git escoteria – they’re just adding a file.
- Syntactically it’s very easy, it’s just a YAML file. No complex case statements etc.
- The contributor does not have to worry about breaking other Operating Systems he could not test on like AIX here. The change is contained to machines for the new OS
- In large environments this help with change control as it’s just data – no logic changes
This helps the maintainer of a module:
- Module maintenance is easier when it comes to adding new Operating Systems as it’s simple single files
- Easier contribution reviews
- Fewer merge commits, less git magic needed, cleaner commit history
- The code is a lot easier to read and maintain. Fewer tests and validations are needed.
This helps the user of a module:
- Well written modules now properly support supplying all data from Hiera
- He has a single place to look for the overridable data
- When using a module that does not support his OS he can deploy it into his site and just provide data instead of forking it
Today I am releasing my proposed code as a standalone module. It provides all the advantages above including the fact that it’s always on without any additional configuration needed.
It works exactly as above by adding a data directory with a hiera.yaml inside it. The only configuration being considered in this hiera.yaml is the hierarchy.
This module is new and does some horrible things to get itself activated automatically without any configuration, I’ve only tested it on Puppet 3.2.x but I think it will work in 3.x as is. I’d love to get feedback on this from users.
If you want to write a forge module that uses this feature simply add a dependency on the ripienaar/module_data module, soon as someone install this dependency along with your module the backend gets activated. Similarly if you just want to use this feature in your own modules, just puppet module install ripienaar/module_data.
Note though that if you do your module will only work on Puppet 3 or newer.
It’s unfortunate that my Pull Request is now over a year old and did not get merged and no real progress is being made. I hope if enough users adopt this solution we can force progress rather than sit by and watch nothing happen. Please send me your feedback and use this widely.
by R.I. Pienaar | Oct 10, 2013 | Code
When using Puppet you often run it in a single run mode on the CLI and then go afk. When you return you might notice it was slow for some or other reason but did not run it with –evaltrace and in debug mode so the information to help you answer this simply isn’t present – or scrolled off or got rotated away from your logs.
Typically you’d deploy something like foreman or report handlers on your masters which would receive and display reports. But while you’re on the shell it’s a big context switch to go and find the report there.
Puppet now saves reports in it’s state dir including with apply if you ran it with –write-catalog-summary and in recent versions these reports include the performance data that you’d only find from –evaltrace.
So to solve this problem I wrote a little tool to show reports on the CLI. It’s designed to run on the shell of the node in question and as root. If you do this it will automatically pick up the latest report and print it but it will also go through and check the sizes of files and show you stats. You can run it against saved reports on some other node but you’ll lose some utility. The main focus of the information presented is to let you see logs from the past run but also information that help you answer why it was slow to run.
It’s designed to work well with very recent versions of Puppet maybe even only 3.3.0 and newer, I’ve not tested it on older versions but will gladly accept patches.
Here are some snippets of a report of one of my nodes and some comments about the sections. A full sample report can be found here.
First it’s going to show you some metadata about the report, what node, when for etc:
sudo report_print.rb
Report for puppetmaster.example.com in environment production at Thu Oct 10 13:37:04 +0000 2013
Report File: /var/lib/puppet/state/last_run_report.yaml
Report Kind: apply
Puppet Version: 3.3.1
Report Format: 4
Configuration Version: 1381412220
UUID: 99503fe8-38f2-4441-a530-d555ede9067b
Log Lines: 350 (show with --log) |
sudo report_print.rb
Report for puppetmaster.example.com in environment production at Thu Oct 10 13:37:04 +0000 2013
Report File: /var/lib/puppet/state/last_run_report.yaml
Report Kind: apply
Puppet Version: 3.3.1
Report Format: 4
Configuration Version: 1381412220
UUID: 99503fe8-38f2-4441-a530-d555ede9067b
Log Lines: 350 (show with --log)
Some important information here, you can see it figured out where to find the report by parsing the Puppet config – agent section – what version of Puppet and what report format. You can also see the report has 350 lines of logs in it but it isn’t showing them by default.
Next up it shows you a bunch of metrics from the report:
Report Metrics:
Changes:
Total: 320
Events:
Total: 320
Success: 320
Failure: 0
Resources:
Total: 436
Out of sync: 317
Changed: 317
Restarted: 7
Failed to restart: 0
Skipped: 0
Failed: 0
Scheduled: 0
Time:
Total: 573.671295
Package: 509.544123
Exec: 33.242635
Puppetdb conn validator: 22.767754
Config retrieval: 4.096973
File: 1.343388
User: 1.337979
Service: 1.180588
Ini setting: 0.127856
Anchor: 0.013984
Datacat collector: 0.008954
Host: 0.003265
Datacat fragment: 0.00277
Schedule: 0.000504
Group: 0.00039
Filebucket: 0.000132 |
Report Metrics:
Changes:
Total: 320
Events:
Total: 320
Success: 320
Failure: 0
Resources:
Total: 436
Out of sync: 317
Changed: 317
Restarted: 7
Failed to restart: 0
Skipped: 0
Failed: 0
Scheduled: 0
Time:
Total: 573.671295
Package: 509.544123
Exec: 33.242635
Puppetdb conn validator: 22.767754
Config retrieval: 4.096973
File: 1.343388
User: 1.337979
Service: 1.180588
Ini setting: 0.127856
Anchor: 0.013984
Datacat collector: 0.008954
Host: 0.003265
Datacat fragment: 0.00277
Schedule: 0.000504
Group: 0.00039
Filebucket: 0.000132
These are numerically sorted and the useful stuff is in the last section – what types were to blame for the biggest slowness in your run. Here we can see we spent 509 seconds just doing packages.
Having seen how long each type of resource took it then shows you a little report of how many resources of each type was found:
Resources by resource type:
288 File
30 Datacat_fragment
25 Anchor
24 Ini_setting
22 User
18 Package
9 Exec
7 Service
6 Schedule
3 Datacat_collector
1 Group
1 Host
1 Puppetdb_conn_validator
1 Filebucket |
Resources by resource type:
288 File
30 Datacat_fragment
25 Anchor
24 Ini_setting
22 User
18 Package
9 Exec
7 Service
6 Schedule
3 Datacat_collector
1 Group
1 Host
1 Puppetdb_conn_validator
1 Filebucket
From here you’ll see detail about resources and files, times, sizes etc. By default it’s going to show you 20 of each but you can increase that using the –count argument.
First we see the evaluation time by resource, this is how long the agent spent to complete a specific resource:
Slowest 20 resources by evaluation time:
356.94 Package[activemq]
41.71 Package[puppetdb]
33.31 Package[apache2-prefork-dev]
33.05 Exec[compile-passenger]
23.41 Package[passenger]
22.77 Puppetdb_conn_validator[puppetdb_conn]
22.12 Package[libcurl4-openssl-dev]
10.94 Package[httpd]
4.78 Package[libapr1-dev]
3.95 Package[puppetmaster]
3.32 Package[ntp]
2.75 Package[puppetdb-terminus]
2.71 Package[mcollective-client]
1.86 Package[ruby-stomp]
1.72 Package[mcollective]
0.58 Service[puppet]
0.30 Service[puppetdb]
0.18 User[jack]
0.16 User[jill]
0.16 User[ant] |
Slowest 20 resources by evaluation time:
356.94 Package[activemq]
41.71 Package[puppetdb]
33.31 Package[apache2-prefork-dev]
33.05 Exec[compile-passenger]
23.41 Package[passenger]
22.77 Puppetdb_conn_validator[puppetdb_conn]
22.12 Package[libcurl4-openssl-dev]
10.94 Package[httpd]
4.78 Package[libapr1-dev]
3.95 Package[puppetmaster]
3.32 Package[ntp]
2.75 Package[puppetdb-terminus]
2.71 Package[mcollective-client]
1.86 Package[ruby-stomp]
1.72 Package[mcollective]
0.58 Service[puppet]
0.30 Service[puppetdb]
0.18 User[jack]
0.16 User[jill]
0.16 User[ant]
You can see by far the longest here was the activemq package that took 356 seconds and contributed most to the 509 seconds that Package types took in total. A clear indication that maybe this machine is picking the wrong mirrors or that I should create my own nearby mirror.
File serving in Puppet is notoriously slow so when run as root on the node in question it will look for all File resources and print the sizes. Unfortunately it can’t know if a file contents came from source or content as that information isn’t in the report. Still this might give you some information on where to target optimization. In this case nothing really stands out:
20 largest managed files (only those with full path as resource name that are readable)
6.50 KB /usr/local/share/mcollective/mcollective/util/actionpolicy.rb
3.90 KB /etc/mcollective/facts.yaml
3.83 KB /var/lib/puppet/concat/bin/concatfragments.sh
2.78 KB /etc/sudoers
1.69 KB /etc/apache2/conf.d/puppetmaster.conf
1.49 KB /etc/puppet/fileserver.conf
1.20 KB /etc/puppet/rack/config.ru
944.00 B /etc/apache2/apache2.conf
573.00 B /etc/ntp.conf
412.00 B /usr/local/share/mcollective/mcollective/util/actionpolicy.ddl
330.00 B /etc/apache2/mods-enabled/passenger.conf
330.00 B /etc/apache2/mods-available/passenger.conf
262.00 B /etc/default/puppet
215.00 B /etc/apache2/mods-enabled/worker.conf
215.00 B /etc/apache2/mods-available/worker.conf
195.00 B /etc/apache2/ports.conf
195.00 B /var/lib/puppet/concat/_etc_apache2_ports.conf/fragments.concat
195.00 B /var/lib/puppet/concat/_etc_apache2_ports.conf/fragments.concat.out
164.00 B /var/lib/puppet/concat/_etc_apache2_ports.conf/fragments/10_Apache ports header
158.00 B /etc/puppet/hiera.yaml |
20 largest managed files (only those with full path as resource name that are readable)
6.50 KB /usr/local/share/mcollective/mcollective/util/actionpolicy.rb
3.90 KB /etc/mcollective/facts.yaml
3.83 KB /var/lib/puppet/concat/bin/concatfragments.sh
2.78 KB /etc/sudoers
1.69 KB /etc/apache2/conf.d/puppetmaster.conf
1.49 KB /etc/puppet/fileserver.conf
1.20 KB /etc/puppet/rack/config.ru
944.00 B /etc/apache2/apache2.conf
573.00 B /etc/ntp.conf
412.00 B /usr/local/share/mcollective/mcollective/util/actionpolicy.ddl
330.00 B /etc/apache2/mods-enabled/passenger.conf
330.00 B /etc/apache2/mods-available/passenger.conf
262.00 B /etc/default/puppet
215.00 B /etc/apache2/mods-enabled/worker.conf
215.00 B /etc/apache2/mods-available/worker.conf
195.00 B /etc/apache2/ports.conf
195.00 B /var/lib/puppet/concat/_etc_apache2_ports.conf/fragments.concat
195.00 B /var/lib/puppet/concat/_etc_apache2_ports.conf/fragments.concat.out
164.00 B /var/lib/puppet/concat/_etc_apache2_ports.conf/fragments/10_Apache ports header
158.00 B /etc/puppet/hiera.yaml
And finally if I ran it with –log I’d get the individual log lines:
350 Log lines:
Thu Oct 10 13:37:06 +0000 2013 /Stage[main]/Concat::Setup/File[/var/lib/puppet/concat]/ensure (notice): created
Thu Oct 10 13:37:06 +0000 2013 /Stage[main]/Concat::Setup/File[/var/lib/puppet/concat/bin]/ensure (notice): created
Thu Oct 10 13:37:06 +0000 2013 /Stage[main]/Concat::Setup/File[/var/lib/puppet/concat/bin/concatfragments.sh]/ensure (notice): defined content as '{md5}2fbba597a1513eb61229551d35d42b9f'
.
.
. |
350 Log lines:
Thu Oct 10 13:37:06 +0000 2013 /Stage[main]/Concat::Setup/File[/var/lib/puppet/concat]/ensure (notice): created
Thu Oct 10 13:37:06 +0000 2013 /Stage[main]/Concat::Setup/File[/var/lib/puppet/concat/bin]/ensure (notice): created
Thu Oct 10 13:37:06 +0000 2013 /Stage[main]/Concat::Setup/File[/var/lib/puppet/concat/bin/concatfragments.sh]/ensure (notice): defined content as '{md5}2fbba597a1513eb61229551d35d42b9f'
.
.
.
The code is on GitHub, I’d like to make it available as a Puppet Forge module but there really is no usable option to achieve this. The Puppet Face framework is the best available option but the UX is so poor that I would not like to expose anyone to this to use my code.
by R.I. Pienaar | Jan 6, 2013 | Code
Redis is an in-memory key-value data store that provides a small number of primitives suitable to the task of building monitoring systems. As a lot of us are hacking in this space I thought I’d write a blog post summarizing where I’ve been using it in a little Sensu like monitoring system I have been working on on and off.
There’s some monitoring related events coming up like MonitoringLove in Antwerp and Monitorama in Boston – I will be attending both and I hope a few members in the community will create similar background posts on various interesting areas before these events.
I’ve only recently started looking at Redis but really like it. It’s a very light weight daemon written in C with fantastic documentation detailing things like each commands performance characteristics and most documantation pages are live in that they have a REPL right on the page like the SET page – note you can type into the code sample and see your changes in real time. It is sponsored by VMWare and released under the 3 clause BSD license.
Redis Data Types
Redis provides a few common data structures:
- Normal key-value storage where every key has just one string value
- Hashes where every key contains a hash of key-values strings
- Lists of strings – basically just plain old Arrays sorted in insertion order that allows duplicate values
- Sets are a bit like Lists but with the addition that a given value can only appear in a list once
- Sorted Sets are sets that in addition to the value also have a weight associated with it, the set is indexed by weight
All the keys support things like expiry based on time and TTL calculation. Additionally it also supports PubSub.
At first it can be hard to imagine how you’d use a data store with only these few data types and capable of only storing strings for monitoring but with a bit of creativity it can be really very useful.
The full reference about all the types can be found in the Redis Docs: Data Types
Monitoring Needs
Monitoring systems generally need a number of different types of storage. These are configuration, event archiving and status and alert tracking. There are more but these are the big ticket items, of the 3 I am only going to focus on the last one – Status and Alert Tracking here.
Status tracking is essentially transient data. If you loose your status view it’s not really a big deal it will be recreated quite quickly as new check results come in. Worst case you’ll get some alerts again that you recently got. This fits well with Redis that doesn’t always commit data soon as it receives it – it flushes roughly every second from memory to disk.
Redis does not provide much by way of SSL or strong authentication so I tend to consider it a single node IPC system rather than say a generic PubSub system. I feed data into a node using system like ActiveMQ and then for comms and state tracking on a single node I’ll use Redis.
I’ll show how it can be used to solve the following monitoring related storage/messaging problems:
- Check Status – a check like load on every node
- Staleness Tracking – you need to know when a node is not receiving check results so you can do alive checks
- Event Notification – your core monitoring system will likely feed into alerters like Opsgenie and metric storage like Graphite
- Alert Tracking – you need to know when you last sent an alert and when you can alert again based on an interval like every 2 hours
Check Status
The check is generally the main item of monitoring systems. Something configures a check like load and then every node gets check results for this item, the monitoring system has to track the status of the checks on a per node basis.
In my example a check result looks more or less like this:
{"lastcheck" => "1357490521",
"count" => "1143",
"exitcode" => "0",
"output" => "OK - load average: 0.23, 0.10, 0.02",
"last_state_change"=> "1357412507",
"perfdata" => '{"load15":0.02,"load5":0.1,"load1":0.23}',
"check" => "load",
"host" => "dev2.devco.net"} |
{"lastcheck" => "1357490521",
"count" => "1143",
"exitcode" => "0",
"output" => "OK - load average: 0.23, 0.10, 0.02",
"last_state_change"=> "1357412507",
"perfdata" => '{"load15":0.02,"load5":0.1,"load1":0.23}',
"check" => "load",
"host" => "dev2.devco.net"}
This is standard stuff and the most boring part – you might guess this goes into a Hash and you’ll be right. Note the count item there Redis has special handling for counters and I’ll show that in a minute.
By convention Redis keys are name spaced by a : so I’d store the check status for a specific node + check combination in a key like status:example.net:load
Updating or creating a new hash is real easy – just write to it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
def save_check(check)
key = "status:%s:%s" % [check.host, check.check]
check.last_state_change = @redis.hget(key, "last_state_change")
check.previous_exitcode = @redis.hget(key, "exitcode")
@redis.multi do
@redis.hset(key, "host", check.host)
@redis.hset(key, "check", check.check)
@redis.hset(key, "exitcode", check.exitcode)
@redis.hset(key, "lastcheck", check.last_check)
@redis.hset(key, "last_state_change", check.last_state_change)
@redis.hset(key, "output", check.output)
@redis.hset(key, "perfdata", check.perfdata)
unless check.changed_state?
@redis.hincrby(key, "count", 1)
else
@redis.hset(key, "count", 1)
end
end
check.count = @redis.hget(key, "count")
end |
def save_check(check)
key = "status:%s:%s" % [check.host, check.check]
check.last_state_change = @redis.hget(key, "last_state_change")
check.previous_exitcode = @redis.hget(key, "exitcode")
@redis.multi do
@redis.hset(key, "host", check.host)
@redis.hset(key, "check", check.check)
@redis.hset(key, "exitcode", check.exitcode)
@redis.hset(key, "lastcheck", check.last_check)
@redis.hset(key, "last_state_change", check.last_state_change)
@redis.hset(key, "output", check.output)
@redis.hset(key, "perfdata", check.perfdata)
unless check.changed_state?
@redis.hincrby(key, "count", 1)
else
@redis.hset(key, "count", 1)
end
end
check.count = @redis.hget(key, "count")
end
Here I assume we have a object that represents a check result called check and we’re more or less just fetching/updating data in it. I first retrieve the previously saved state of exitcode and last state change time and save those into the object. The object will do some internal state management to determine if the current check result represents a changed state – OK to WARNING etc – based on this information.
The @redis.multi starts a transaction, everything inside the block will be written in an atomic way by the Redis server thus ensuring we do not have any half-baked state while other parts of the system might be reading the status of this check.
As I said the check determines if the current result is a state change when I set the previous exitcode on line 5 this means lines 16-20 will either set the count to 1 if it’s a change or just increment the count if not. We use the internal Redis counter handling on line 17 to avoid having to first fetch the count and then update it and saving it, this saves a round trip to the database.
You can now just retrieve the whole hash with the HGETALL command, even on the command line:
% redis-cli hgetall status:dev2.devco.net:load
1) "check"
2) "load"
3) "host"
4) "dev2.devco.net"
5) "output"
6) "OK - load average: 0.00, 0.00, 0.00"
7) "lastcheck"
8) "1357494721"
9) "exitcode"
10) "0"
11) "perfdata"
12) "{\"load15\":0.0,\"load5\":0.0,\"load1\":0.0}"
13) "last_state_change"
14) "1357412507"
15) "count"
16) "1178" |
% redis-cli hgetall status:dev2.devco.net:load
1) "check"
2) "load"
3) "host"
4) "dev2.devco.net"
5) "output"
6) "OK - load average: 0.00, 0.00, 0.00"
7) "lastcheck"
8) "1357494721"
9) "exitcode"
10) "0"
11) "perfdata"
12) "{\"load15\":0.0,\"load5\":0.0,\"load1\":0.0}"
13) "last_state_change"
14) "1357412507"
15) "count"
16) "1178"
References: Redis Hashes, MULTI, HSET, HINCRBY, HGET, HGETALL
Staleness Tracking
Staleness Tracking here means we want to know when last we saw any data about a node, if the node is not providing information we need to go and see what happened to it. Maybe it’s up but the data sender died or maybe it’s crashed.
This is where we really start using some of the Redis features to save us time. We need to track when last we saw a specific node and then we have to be able to quickly find all nodes not seen within certain amount of time like 120 seconds.
We could retrieve all the check results and check their last updated time and so figure it out but that’s not optimal.
This is what Sorted Lists are for. Remember Sorted Lists have a weight and orders the list by the weight, if we use the timestamp that we last received data at for a host as the weight it means we can very quickly fetch a list of stale hosts.
1
2
3
|
def update_host_last_seen(host, time)
@redis.zadd("host:last_seen", time, host)
end |
def update_host_last_seen(host, time)
@redis.zadd("host:last_seen", time, host)
end
When we call this code like update_host_last_seen(“dev2.devco.net”, Time.now.utc.to_i) the host will either be added to or updated in the Sorted List based on the current UTC time. We do this every time we save a new result set with the code in the previous section.
To get a list of hosts that we have not seen in the last 120 seconds is really easy now:
1
2
3
|
def get_stale_hosts(age)
@redis.zrangebyscore("host:last_seen", 0, (Time.now.utc.to_i - age))
end |
def get_stale_hosts(age)
@redis.zrangebyscore("host:last_seen", 0, (Time.now.utc.to_i - age))
end
If we call this with an age like 120 we’ll get an array of nodes that have not had any data within the last 120 seconds.
You can do the same check on the CLI, this shows all the machines not seen in the last 60 seconds:
% redis-cli zrangebyscore host:last_seen 0 $(expr $(date +%s) - 60)
1) "dev1.devco.net" |
% redis-cli zrangebyscore host:last_seen 0 $(expr $(date +%s) - 60)
1) "dev1.devco.net"
Reference: Sorted Sets, ZADD, ZRANGEBYSCORE
Event Notification
When a check result enters the system thats either a state change, a problem or have metrics associated it we’d want to send those on to other pieces of code.
We don’t know or care who those interested parties are we only care that there might be some interested parties – it might be something writing to Graphite or OpenTSDB or both at the same time or something alerting to Opsgenie or Pager Duty. This is a classic use case for PubSub and Redis has a good PubSub subsystem that we’ll use for this.
I am only going to show the metrics publishing – problem and state changes are very similar:
1
2
3
4
5
6
7
8
9
10
11
|
def publish_metrics(check)
if check.has_perfdata?
msg = {"metrics" => check.perfdata, "type" => "metrics", "time" => check.last_check, "host" => check.host, "check" => check.check}.to_json
publish(["metrics", check.host, check.check], msg)
end
end
def publish(type, message)
target = ["overwatch", Array(type).join(":")].join(":")
@redis.publish(target, message)
end |
def publish_metrics(check)
if check.has_perfdata?
msg = {"metrics" => check.perfdata, "type" => "metrics", "time" => check.last_check, "host" => check.host, "check" => check.check}.to_json
publish(["metrics", check.host, check.check], msg)
end
end
def publish(type, message)
target = ["overwatch", Array(type).join(":")].join(":")
@redis.publish(target, message)
end
This is pretty simple stuff, we’re just publishing some JSON to a named destination like overwatch:metrics:dev1.devco.net:load. We can now write small standalone single function tools that consume this stream of metrics and send it wherever we like – like Graphite or OpenTSDB.
We publish similar events for any incoming check result that is not OK and also for any state transition like CRITICAL to OK, these would be consumed by alerter handlers that might feed pagers or SMS.
We’re publishing these alerts to to destinations that include the host and specific check – this way we can very easily create individual host views of activity by doing pattern based subscribes.
Reference: PubSub, PUBLISH
Alert Tracking
Alert Tracking means keeping track of which alerts we’ve already sent and when we’ll need to send them again like only after 2 hours of the same problem and not on every check result which might come in every minute.
Leading on from the previous section we’d just consume the problem and state change PubSub channels and react on messages from those:
A possible consumer of this might look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
@redis.psubscribe("overwatch:state_change:*", "overwatch:issues:*") do |on|
on.pmessage do |channel, message|
event = JSON.parse(message)
case event["type"]
when "issue"
sender.notify_issue(event["issue"]["exitcode"], event["host"], event["check"], event["issue"]["output"])
when "state_change"
if event["state_change"]["exitcode"] == 0
sender.notify_recovery(event["host"], event["check"], event["state_change"]["output"])
end
end
end
end |
@redis.psubscribe("overwatch:state_change:*", "overwatch:issues:*") do |on|
on.pmessage do |channel, message|
event = JSON.parse(message)
case event["type"]
when "issue"
sender.notify_issue(event["issue"]["exitcode"], event["host"], event["check"], event["issue"]["output"])
when "state_change"
if event["state_change"]["exitcode"] == 0
sender.notify_recovery(event["host"], event["check"], event["state_change"]["output"])
end
end
end
end
This subscribes to the 2 channels and pass the incoming events to a notifier. Note we’re using the patterns here to catch all alerts and changes for all hosts.
The problem here is that without any special handling this is going to fire off alerts every minute assuming we check the load every minute. This is where Redis expiry of keys come in.
We’ll need to track which messages we have sent when and on any state change clear the tracking thus restarting the counters.
So we’ll just add keys called “alert:dev2.devco.net:load:3” to indicate an UNKNOWN state alert for load on dev2.devco.net:
1
2
3
4
5
|
def record_alert(host, check, status, expire=7200)
key = "alert:%s:%s:%d" % [host, check, status]
@redis.set(key, 1)
@redis.expire(key, expire)
end |
def record_alert(host, check, status, expire=7200)
key = "alert:%s:%s:%d" % [host, check, status]
@redis.set(key, 1)
@redis.expire(key, expire)
end
This takes an expire time which defaults to 2 hours and tells redis to just remove the key when its time is up.
With this we need a way to figure out if we can send again:
1
2
3
4
|
def alert_ttl(host, check, status)
key = "alert:%s:%s:%d" % [host, check, status]
@redis.ttl(key)
end |
def alert_ttl(host, check, status)
key = "alert:%s:%s:%d" % [host, check, status]
@redis.ttl(key)
end
This will return the amount of seconds till next alert and -1 if we are ready to send again
And finally on every state change we need to just purge all the tracking for a given node + check combo. The reason for this is if we notified on CRITICAL a minute ago then the service recovers to OK but soon goes to CRITICAL again this most recent CRITICAL alert will be suppressed as part of the previous cycle of alerts.
1
2
3
|
def clear_alert_ttls(host, check)
@redis.del(@redis.keys.grep(/^alert:#{host}:#{check}:\d/))
end |
def clear_alert_ttls(host, check)
@redis.del(@redis.keys.grep(/^alert:#{host}:#{check}:\d/))
end
So now I can show the two methods that will actually publish the alerts:
The first notifies of issues but only every @interval seconds and it uses the alert_ttl helper above to determine if it should or shouldn’t send:
1
2
3
4
5
6
7
8
9
10
11
12
|
def notify_issue(exitcode, host, check, output)
if (ttl = @storage.alert_ttl(host, check, exitcode)) == -1
subject = "%s %s#%s" % [status_for_code(exitcode), host, check]
message = "%s: %s" % [subject, output]
send(message, subject, @recipients)
@redis.record_alert(host, check, exitcode, @alert_interval)
else
Log.info("Not alerting %s#%s due to interval restrictions, next alert in %d seconds" % [host, check, ttl])
end
end |
def notify_issue(exitcode, host, check, output)
if (ttl = @storage.alert_ttl(host, check, exitcode)) == -1
subject = "%s %s#%s" % [status_for_code(exitcode), host, check]
message = "%s: %s" % [subject, output]
send(message, subject, @recipients)
@redis.record_alert(host, check, exitcode, @alert_interval)
else
Log.info("Not alerting %s#%s due to interval restrictions, next alert in %d seconds" % [host, check, ttl])
end
end
The second will publish recovery notices and we’d always want those and they will not repeat, here we clear all the previous alert tracking to avoid incorrect alert surpressions:
1
2
3
4
5
6
7
8
|
def notify_recovery(host, check, output)
subject = "RECOVERY %s#%s" % [host, check]
message = "%s: %s" % [subject, output]
send_alert(message, subject, @recipients)
@redis.clear_alert_ttls(host, check)
end |
def notify_recovery(host, check, output)
subject = "RECOVERY %s#%s" % [host, check]
message = "%s: %s" % [subject, output]
send_alert(message, subject, @recipients)
@redis.clear_alert_ttls(host, check)
end
References: SET, EXPIRE, SUBSCRIBE, TTL, DEL
Conclusion
This covered a few Redis basics but it’s a very rich system that can be used in many areas so if you are interested spend some quality time with its docs.
Using its facilities saved me a ton of effort while working on a small monitoring system. It is fast and light weight and enable cross language collaboration that I’d have found hard to replicate in a performant manner without it.
by R.I. Pienaar | Jan 1, 2013 | Code
Most Nagios systems does a lot of forking especially those built around something like NRPE where each check is a connection to be made to a remote system. On one hand I like NRPE in that it puts the check logic on the nodes using a standard plugin format and provides a fairly re-usable configuration file but on the other hand the fact that the Nagios machine has to do all this forking has never been good for me.
In the past I’ve shown one way to scale checks by aggregate all results for a specific check into one result but this is not always a good fit as pointed out in the post. I’ve now built a system that use the same underlying MCollective infrastructure as in the previous post but without the aggregation.
I have a pair of Nagios nodes – one in the UK and one in France – and they are on quite low spec VMs doing around 400 checks each. The problems I have are:
- The machines are constantly loaded under all the forking, one would sit on 1.5 Load Average almost all the time
- They use a lot of RAM and it’s quite spikey, if something is wrong especially I’d have a lot of checks concurrently so the machines have to be bigger than I want them
- The check frequency is quite low in the usual Nagios manner, sometimes 10 minutes can go by without a check
- The check results do not represent a point in time, I have no idea how the check results of node1 relate to those on node2 as they can be taken anywhere in the last 10 minutes
These are standard Nagios complaints though and there are many more but these ones specifically is what I wanted to address right now with the system I am showing here.
Probably not a surprise but the solution is built on MCollective, it uses the existing MCollective NRPE agent and the existing queueing infrastructure to push the forking to each individual node – they would do this anyway for every NRPE check – and read the results off a queue and spool it into the Nagios command file as Passive results. Internally it splits the traditional MCollective request-response system into a async processing system using the technique I blogged about before.
As you can see the system is made up of a few components:
- The Scheduler takes care of publishing requests for checks
- MCollective and the middleware provides AAA and transport
- The nodes all run the MCollective NRPE agent which put their replies on the Queue
- The Receiver reads the results from the Queue and write them to the Nagios command file
The Scheduler
The scheduler daemon is written using the excellent Rufus Scheduler gem – if you do not know it you totally should check it out, it solves many many problems. Rufus allows me to create simple checks on intervals like 60s and I combine these checks with MCollective filters to create a simple check configuration as below:
nrpe 'check_bacula_main', '6h', 'bacula::node monitored_by=monitor1'
nrpe 'check_disks', '60s', 'monitored_by=monitor1'
nrpe 'check_greylistd', '60s', 'greylistd monitored_by=monitor1'
nrpe 'check_load', '60s', 'monitored_by=monitor1'
nrpe 'check_mailq', '60s', 'monitored_by=monitor1'
nrpe 'check_mongodb', '60s', 'mongodb monitored_by=monitor1'
nrpe 'check_mysql', '60s', 'mysql::server monitored_by=monitor1'
nrpe 'check_pki', '60m', 'monitored_by=monitor1'
nrpe 'check_swap', '60s', 'monitored_by=monitor1'
nrpe 'check_totalprocs', '60s', 'monitored_by=monitor1'
nrpe 'check_zombieprocs', '60s', 'monitored_by=monitor1' |
nrpe 'check_bacula_main', '6h', 'bacula::node monitored_by=monitor1'
nrpe 'check_disks', '60s', 'monitored_by=monitor1'
nrpe 'check_greylistd', '60s', 'greylistd monitored_by=monitor1'
nrpe 'check_load', '60s', 'monitored_by=monitor1'
nrpe 'check_mailq', '60s', 'monitored_by=monitor1'
nrpe 'check_mongodb', '60s', 'mongodb monitored_by=monitor1'
nrpe 'check_mysql', '60s', 'mysql::server monitored_by=monitor1'
nrpe 'check_pki', '60m', 'monitored_by=monitor1'
nrpe 'check_swap', '60s', 'monitored_by=monitor1'
nrpe 'check_totalprocs', '60s', 'monitored_by=monitor1'
nrpe 'check_zombieprocs', '60s', 'monitored_by=monitor1'
Taking the first line it says: Run the check_bacula_main NRPE check every 6 hours on machines with the bacula::node Puppet Class and with the fact monitored_by=monitor1. I had the monitored_by fact already to assist in building my Nagios configs using a simple search based approach in Puppet.
When the scheduler starts it will log:
W, [2012-12-31T22:10:12.186789 #32043] WARN -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 0 to stomp://nagios@stomp.example.net:6163
W, [2012-12-31T22:10:12.193405 #32043] WARN -- : activemq.rb:101:in `on_connected' Conncted to stomp://nagios@stomp.example.net:6163
I, [2012-12-31T22:10:12.196387 #32043] INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_bacula_main every 6h matching 'bacula::node monitored_by=monitor1', first in 19709s
I, [2012-12-31T22:10:12.196632 #32043] INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_disks every 60s matching 'monitored_by=monitor1', first in 57s
I, [2012-12-31T22:10:12.197173 #32043] INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_load every 60s matching 'monitored_by=monitor1', first in 23s
I, [2012-12-31T22:10:35.326301 #32043] INFO -- : scheduler.rb:26:in `nrpe' Publishing request for check_load with filter 'monitored_by=monitor1' |
W, [2012-12-31T22:10:12.186789 #32043] WARN -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 0 to stomp://nagios@stomp.example.net:6163
W, [2012-12-31T22:10:12.193405 #32043] WARN -- : activemq.rb:101:in `on_connected' Conncted to stomp://nagios@stomp.example.net:6163
I, [2012-12-31T22:10:12.196387 #32043] INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_bacula_main every 6h matching 'bacula::node monitored_by=monitor1', first in 19709s
I, [2012-12-31T22:10:12.196632 #32043] INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_disks every 60s matching 'monitored_by=monitor1', first in 57s
I, [2012-12-31T22:10:12.197173 #32043] INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_load every 60s matching 'monitored_by=monitor1', first in 23s
I, [2012-12-31T22:10:35.326301 #32043] INFO -- : scheduler.rb:26:in `nrpe' Publishing request for check_load with filter 'monitored_by=monitor1'
You can see it reads the file and schedule the first check a random interval between now and the interval window this spread out the checks.
The Receiver
The receiver has almost no config, it just need to know what queue to read and where your Nagios command file lives, it logs:
I, [2013-01-01T11:49:38.295661 #23628] INFO -- : mnrpes.rb:35:in `daemonize' Starting in the background
W, [2013-01-01T11:49:38.302045 #23631] WARN -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 0 to stomp://nagios@stomp.example.net:6163
W, [2013-01-01T11:49:38.310853 #23631] WARN -- : activemq.rb:101:in `on_connected' Conncted to stomp://nagios@stomp.example.net:6163
I, [2013-01-01T11:49:38.310980 #23631] INFO -- : receiver.rb:16:in `subscribe' Subscribing to /queue/mcollective.nagios_passive_results_monitor1
I, [2013-01-01T11:49:41.572362 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040981] PROCESS_SERVICE_CHECK_RESULT;node1.example.net;mongodb;0;OK: connected, databases admin local my_db puppet mcollective
I, [2013-01-01T11:49:42.509061 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040982] PROCESS_SERVICE_CHECK_RESULT;node2.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z
I, [2013-01-01T11:49:42.510574 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040982] PROCESS_SERVICE_CHECK_RESULT;node3.example.net;zombieprocs;0;PROCS OK: 1 process with STATE = Z |
I, [2013-01-01T11:49:38.295661 #23628] INFO -- : mnrpes.rb:35:in `daemonize' Starting in the background
W, [2013-01-01T11:49:38.302045 #23631] WARN -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 0 to stomp://nagios@stomp.example.net:6163
W, [2013-01-01T11:49:38.310853 #23631] WARN -- : activemq.rb:101:in `on_connected' Conncted to stomp://nagios@stomp.example.net:6163
I, [2013-01-01T11:49:38.310980 #23631] INFO -- : receiver.rb:16:in `subscribe' Subscribing to /queue/mcollective.nagios_passive_results_monitor1
I, [2013-01-01T11:49:41.572362 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040981] PROCESS_SERVICE_CHECK_RESULT;node1.example.net;mongodb;0;OK: connected, databases admin local my_db puppet mcollective
I, [2013-01-01T11:49:42.509061 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040982] PROCESS_SERVICE_CHECK_RESULT;node2.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z
I, [2013-01-01T11:49:42.510574 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040982] PROCESS_SERVICE_CHECK_RESULT;node3.example.net;zombieprocs;0;PROCS OK: 1 process with STATE = Z
As the results get pushed to Nagios I see the following in its logs:
[1357042122] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node1.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z
[1357042124] PASSIVE SERVICE CHECK: node1.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z |
[1357042122] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node1.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z
[1357042124] PASSIVE SERVICE CHECK: node1.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z
Did it solve my problems?
I listed the set of problems I wanted to solve so it’s worth evaluating if I did solve them properly.
Less load and RAM use on the Nagios nodes
My Nagios nodes have gone from load averages of 1.5 to 0.1 or 0.0, they are doing nothing, they use a lot less RAM and I have removed some of the RAM from the one and given it to my Jenkins VM instead, it was a huge win. The sender and receiver is quite light on resources as you can see below:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
nagios 9757 0.4 1.8 130132 36060 ? S 2012 3:41 ruby /usr/bin/mnrpes-receiver --pid=/var/run/mnrpes/mnrpes-receiver.pid --config=/etc/mnrpes/mnrpes-receiver.cfg
nagios 9902 0.3 1.4 120056 27612 ? Sl 2012 2:22 ruby /usr/bin/mnrpes-scheduler --pid=/var/run/mnrpes/mnrpes-scheduler.pid --config=/etc/mnrpes/mnrpes-scheduler.cfg |
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
nagios 9757 0.4 1.8 130132 36060 ? S 2012 3:41 ruby /usr/bin/mnrpes-receiver --pid=/var/run/mnrpes/mnrpes-receiver.pid --config=/etc/mnrpes/mnrpes-receiver.cfg
nagios 9902 0.3 1.4 120056 27612 ? Sl 2012 2:22 ruby /usr/bin/mnrpes-scheduler --pid=/var/run/mnrpes/mnrpes-scheduler.pid --config=/etc/mnrpes/mnrpes-scheduler.cfg
On the RAM side I now never get a pile up of many checks. I do have the stale detection enabled on my Nagios template so if something breaks in the scheduler/receiver/broker triplet Nagios will still try to do a traditional check to see what’s going on but that’s bearable.
Check frequency too low
With this system I could do my checks every 10 seconds without any problems, I settled on 60 seconds as that’s perfect for me. Rufus scheduler does a great job of managing that and the requests from the scheduler are effectively fire and forget as long as the broker is up.
Results are spread over 10 minutes
The problem with the results for load on node1 and node2 having no temporal correlation is gone too now, because I use MCollectives parallel nature all the load checks happen at the same time:
Here is the publisher:
I, [2013-01-01T12:00:14.296455 #20661] INFO -- : scheduler.rb:26:in `nrpe' Publishing request for check_load with filter 'monitored_by=monitor1' |
I, [2013-01-01T12:00:14.296455 #20661] INFO -- : scheduler.rb:26:in `nrpe' Publishing request for check_load with filter 'monitored_by=monitor1'
…and the receiver:
I, [2013-01-01T12:00:14.380981 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node1.example.net;load;0;OK - load average: 0.92, 0.54, 0.42|load1=0.920;9.000;10.000;0; load5=0.540;8.000;9.000;0; load15=0.420;7.000;8.000;0;
I, [2013-01-01T12:00:14.383875 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node2.example.net;load;0;OK - load average: 0.00, 0.00, 0.00|load1=0.000;1.500;2.000;0; load5=0.000;1.500;2.000;0; load15=0.000;1.500;2.000;0;
I, [2013-01-01T12:00:14.387427 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node3.example.net;load;0;OK - load average: 0.02, 0.07, 0.07|load1=0.020;1.500;2.000;0; load5=0.070;1.500;2.000;0; load15=0.070;1.500;2.000;0;
I, [2013-01-01T12:00:14.388754 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node4.example.net;load;0;OK - load average: 0.07, 0.02, 0.00|load1=0.070;1.500;2.000;0; load5=0.020;1.500;2.000;0; load15=0.000;1.500;2.000;0;
I, [2013-01-01T12:00:14.404650 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node5.example.net;load;0;OK - load average: 0.03, 0.09, 0.04|load1=0.030;1.500;2.000;0; load5=0.090;1.500;2.000;0; load15=0.040;1.500;2.000;0;
I, [2013-01-01T12:00:14.405689 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node6.example.net;load;0;OK - load average: 0.06, 0.06, 0.07|load1=0.060;3.000;4.000;0; load5=0.060;3.000;4.000;0; load15=0.070;3.000;4.000;0;
I, [2013-01-01T12:00:14.489590 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node7.example.net;load;0;OK - load average: 0.06, 0.14, 0.14|load1=0.060;1.500;2.000;0; load5=0.140;1.500;2.000;0; load15=0.140;1.500;2.000;0; |
I, [2013-01-01T12:00:14.380981 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node1.example.net;load;0;OK - load average: 0.92, 0.54, 0.42|load1=0.920;9.000;10.000;0; load5=0.540;8.000;9.000;0; load15=0.420;7.000;8.000;0;
I, [2013-01-01T12:00:14.383875 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node2.example.net;load;0;OK - load average: 0.00, 0.00, 0.00|load1=0.000;1.500;2.000;0; load5=0.000;1.500;2.000;0; load15=0.000;1.500;2.000;0;
I, [2013-01-01T12:00:14.387427 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node3.example.net;load;0;OK - load average: 0.02, 0.07, 0.07|load1=0.020;1.500;2.000;0; load5=0.070;1.500;2.000;0; load15=0.070;1.500;2.000;0;
I, [2013-01-01T12:00:14.388754 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node4.example.net;load;0;OK - load average: 0.07, 0.02, 0.00|load1=0.070;1.500;2.000;0; load5=0.020;1.500;2.000;0; load15=0.000;1.500;2.000;0;
I, [2013-01-01T12:00:14.404650 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node5.example.net;load;0;OK - load average: 0.03, 0.09, 0.04|load1=0.030;1.500;2.000;0; load5=0.090;1.500;2.000;0; load15=0.040;1.500;2.000;0;
I, [2013-01-01T12:00:14.405689 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node6.example.net;load;0;OK - load average: 0.06, 0.06, 0.07|load1=0.060;3.000;4.000;0; load5=0.060;3.000;4.000;0; load15=0.070;3.000;4.000;0;
I, [2013-01-01T12:00:14.489590 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node7.example.net;load;0;OK - load average: 0.06, 0.14, 0.14|load1=0.060;1.500;2.000;0; load5=0.140;1.500;2.000;0; load15=0.140;1.500;2.000;0;
All the results are from the same second, win.
Conclusion
So my scaling issues on my small site is solved and I think the way this is built will work for many people. The code is on GitHub and requires MCollective 2.2.0 or newer.
Having reused the MCollective and Rufus libraries for all the legwork including logging, daemonizing, broker connectivity, addressing and security I was able to build this in a very short time, the total code base is only 237 lines excluding packaging etc. which is a really low number of lines for what it does.