R.I. Pienaar | R.I.Pienaar

Interact with munin-node from Ruby

by R.I. Pienaar | Oct 2, 2011 | Code

I’ve blogged a lot about a new kind of monitoring but what I didn’t point out is that I do actually like the existing toolset.

I quite like Nagios. It’s configuration is horrible yes, the web ui is near useless, it throws away useful information like perfdata. It is though a good poller, it’s solid, never crashes, doesn’t use too much resources and have created a fairly decent plugin protocol (except for it’s perfdata representation).

I am at two minds about munin, I like munin-node and the plugin model. I love that there are 100s of plugins available already. I love the introspection that let’s machines discover their own capabilities. But I hate everything about the central munin poller that’s supposed to be able to scale and query all your servers and pre-create graphs. It simply doesn’t work, even on a few 100 machines it’s a completely broken model.

So I am trying to find ways to keep these older tools – and their collective thousands of plugins – around but improve things to bring them into the fold of my ideas about monitoring.

For munin I want to get rid of the central poller, I’d rather have each node produce its data and push it somewhere. In my case I want to put the data into a middleware queue and process the data later into an archive or graphite or some other system like OpenTSDB. I had a look around for some Ruby / Munin integrations and came across a few, I only investigated 2.

Adam Jacob has a nice little munin 2 graphite script that simply talks straight to graphite, this might be enough for some of you so check it out. I also found munin-ruby from Dan Sosedoff which is what I ended up using.

Using the munin-ruby code is really simple:

#!/usr/bin/ruby
 
require 'rubygems'
require 'munin-ruby'
 
# connect to munin on localhost
munin = Munin::Node.new("localhost", :port => 4949)
 
# get each service and print it's metrics
munin.services.each do |service|
   puts "Metrics for service: #{service}"
 
   munin.service(service).params.each_pair do |k, v|
      puts "   #{k} => #{v}"
   end
 
   puts
end

This creates output like this:

Metrics for service: entropy
   entropy => 174
 
Metrics for service: forks
   forks => 7114853

So from here it’s not far to go to get these events onto my middleware, I turn them into JSON blobs like, the last one is a stat about the collector:

{"name":"munin","text":"entropy","subject":"devco.net","tags":{},"metrics":{"entropy.entropy":"162"},"origin":"munin","type":"metric","event_time":1317548538,"severity":0}
{"name":"munin","text":"forks","subject":"devco.net","tags":{},"metrics":{"forks.forks":"7115300"},"origin":"munin","type":"metric","event_time":1317548538,"severity":0}
{"name":"munin","text":"","subject":"devco.net","tags":{},"metrics":{"um_munin.time":3.722587,"um_munin.services":27,"um_munin.metrics":109,"um_munin.sleep":4},"origin":"munin","type":"metric","event_time":1317548538,"severity":0}

The code that creates and sends this JSON can be seen here, it’s probably useful just to learn from and create your own as that’s a bit specific to me.

Of course my event system already has the infrastructure to turn these JSON events into graphite data that you can see in the image attached to this post so this was a really quick win.

The remaining question is about presentation, I want to create some kind of quick node view system like Munin has. I loved the introspection that you can do to a munin node to discover graph properties there might be something there I can use otherwise I’ll end up making a simple viewer for this.

I imagine for each branch of the munin data like cpu I can either by default just show all the data or take hints from a small DSL no how to present the data there. You’d need to know that some data needs to be derived or used as guages etc. More on that when I had some time to play.

Social networks for servers

by R.I. Pienaar | Aug 14, 2011 | Uncategorized

A while ago Techcrunch profiled a company called Nodeable who closed 2 mil funding. They bill themselves as a social network for servers and have some cartoon and a beta invite box on their site but no actual usable information. I signed up but never heard from them. So I’ve not seen what they’re doing at all.
Either way I thought the idea sucked.

Since then I kept coming back to it thinking maybe it’s not bad at all, I’ve seen many companies try to include the rest of the business into the status of their networks with big graph boards and complex alerting that is perhaps not suited to the audience.

These experiments often fail and cause more confusion than clarity as the underlying systems are not designed to be friendly to business people. I had a quick twitter convo with @patrickdebois too and a few people on ##infra-talk were keen on the idea. It’s not really a surprise that a lot of us want to make the events stream of our systems more accessible to the business and other interested parties.

So I setup a copy of status.net – actually I used the excellent appliance from Turnkey Linux and it took 10 minutes. I gave each of my machines an account with the username being their MAC address and hooked into my existing event stream, it was all less than 150 lines of code and the result is quite pleasing.

What makes this compelling is specifically that it is void of technical details, no mention of /dev/sda1 and byte counts and percentages that makes text hard to scan or understand by non tech people. Just simple things like Experiencing high load #warning This is something normal people can easily digest. It’s small enough to scan really quickly and for many users this is all they need to know.

At the moment I have Puppet changes, IDS events and Nagios events showing up on a twitter like timeline for all my machines. I hash tag the tweets using things like #security, #puppet, and #fail for failing puppet resources. #critical, #warning, #ok for nagios etc. I plan on also adding hash tags matching machine roles as captured in my CM. Click on the image to the right for a bigger example.

Status.net is unfortunately not the tool to build this on, it’s simply too buggy and too limited. You can make groups and add machines to groups but this isn’t something like Twitters lists thats user managed, I can see a case where a webmaster will just add the machines he knows his apps runs on in a list and follow that. You can’t easily do this with status.net. My machines has their fqdn as real names, why on earth status.net doesn’t show real names in the timeline I don’t get, I hope it’s a setting I missed. I might look towards something like Yammer for this or if Nodable eventually ships something that might do.

I think the idea has a lot of merit. If I think about the 500 people I follow on twitter, its hard work but not at all unmanageable and you would hope those 500 people are more chatty than a well managed set of servers. The tools we already use like lists, selective following, hashtags and clients for mobiles, desktop, email notifications and RSS all apply to this use case.

Imagine your servers profile information contained a short description of function. The contact email address is the team responsible for it. The geo information is datacenter coordinates. You could identify ‘hot spots’ in your infrastructure by just looking at tweets on a map. Just like we do with tweets for people.

I think the idea has legs, status.net is a disappointment. I am quite keen to see what Nodeable comes out with and I will keep playing with this idea.

Rich data on the CLI

by R.I. Pienaar | Jul 29, 2011 | Code

I’ve often wondered how things will change in a world where everything is a REST API and how relevant our Unix CLI tool chain will be in the long run. I’ve known we needed CLI ways to interact with data – like JSON data – and have given this a lot of thought.

MS Powershell does some pretty impressive object parsing on their CLI but I was never really sure how close we could get to that in Unix. I’ve wanted to start my journey with the grep utility as that seemed a natural starting point and my most used CLI tool.

I have no idea how to write parsers and matchers but luckily I have a very talented programmer working for me who were able to take my ideas and realize them awesomely. Pieter wrote a json grep and I want to show off a few bits of what it can do.

I’ll work with the document below:

[
  {"name":"R.I.Pienaar",
   "contacts": [
                 {"protocol":"twitter", "address":"ripienaar"},
                 {"protocol":"email", "address":"rip@devco.net"},
                 {"protocol":"msisdn", "address":"1234567890"}
               ]
  },
  {"name":"Pieter Loubser",
   "contacts": [
                 {"protocol":"twitter", "address":"pieterloubser"},
                 {"protocol":"email", "address":"foo@example.com"},
                 {"protocol":"msisdn", "address":"1234567890"}
               ]
  }
]

There are a few interesting things to note about this data:

The document is an array of hashes, this maps well to the stream of data paradigm we know from lines of text in a file. This is the basic structure jgrep works on.
Each document has another nested set of documents in an array – the contacts array.

Examples

The examples below show a few possible grep use cases:

A simple grep for a single key in the document:

$ cat example.json | jgrep "name='R.I.Pienaar'"
[
  {"name":"R.I.Pienaar",
   "contacts": [
                 {"protocol":"twitter", "address":"ripienaar"},
                 {"protocol":"email", "address":"rip@devco.net"},
                 {"protocol":"msisdn", "address":"1234567890"}
               ]
  }
]

We can extract a single key from the result:

$ cat example.json | jgrep "name='R.I.Pienaar'" -s name
R.I.Pienaar

A simple grep for 2 keys in the document:

% cat example.json | 
    jgrep "name='R.I.Pienaar' and contacts.protocol=twitter" -s name
R.I.Pienaar

The nested document pose a problem though, if we were to search for contacts.protocol=twitter and contacts.address=1234567890 we will get both documents and not none, that’s because in order to effectively search the sub documents we need to ensure that these 2 values exist in the same sub document.

$ cat example.json | 
     jgrep "[contacts.protocol=twitter and contacts.address=1234567890]"

Placing [] around the 2 terms works like () but restricts the search to the specific sub document. In this case there is no sub document in the contacts array that has both twitter and 1234567890.

Of course you can have many search terms:

% cat example.json | 
     jgrep "[contacts.protocol=twitter and contacts.address=1234567890] or name='R.I.Pienaar'" -s name
R.I.Pienaar

We can also construct entirely new documents:

% cat example.json | jgrep "name='R.I.Pienaar'" -s "name contacts.address"
[
  {
    "name": "R.I.Pienaar",
    "contacts.address": [
      "ripienaar",
      "rip@devco.net",
      "1234567890"
    ]
  }
]

Real World

So I am adding JSON output support to MCollective, today I was rolling out a new Nagios check script to my nodes and wanted to be sure they all had it. I used the File Manager agent to fetch the stats for my file from all the machines then printed the ones that didn’t match my expected MD5.

$ mco rpc filemgr status file=/.../check_puppet.rb -j | 
   jgrep 'data.md5!=a4fdf7a8cc756d0455357b37501c24b5' -s sender
box1.example.com

Eventually you will be able to then pipe this output to mco again and call another agent, here I take all the machines that didn’t yet have the right file and cause a puppet run to happen on them, this is very Powershell like and the eventual use case I am building this for:

$ mco rpc filemgr status file=/.../check_puppet.rb -j | 
   jgrep 'data.md5!=a4fdf7a8cc756d0455357b37501c24b5' |
   mco rpc puppetd runonce

I also wanted to know the total size of a logfile across my web servers to be sure I would have enough space to copy them all:

$ mco rpc filemgr status file=/var/log/httpd/access_log -W /apache/ -j |
    jgrep -s "data.size"|
    awk '{ SUM += $1} END { print SUM/1024/1024 " MB"}'
2757.9093 MB

Now how about interacting with a webservice like the GitHub API:

$ curl -s http://github.com/api/v2/json/commits/list/puppetlabs/marionette-collective/master|
   jgrep --start commits "author.name='Pieter Loubser'" -s id
52470fee0b9fe14fb63aeb344099d0c74eaf7513

Here I fetched the most recent commits in the marionette-collective GitHub repository, searched for ones by Pieter and returns the ID of those commits. The –start argument is needed because the top of the JSON returned is not the array we care for. The –start tells jgrep to take the commits key and grep that.

Or since it’s Sysadmin Appreciation Day how about tweets about it:

% curl -s "http://search.twitter.com/search.json?q=sysadminday"|
   jgrep --start results -s "text"
 
RT @RedHat_Training: Did you know that today is Systems Admin Day?  A big THANK YOU to all our system admins!  Here's to you!  http://t.co/ZQk8ifl
RT @SinnerBOFH: #BOFHers RT @linuxfoundation: Happy #SysAdmin Day! You know who you are, rock stars. http://t.co/kR0dhhc #linux
RT @google: Hey, sysadmins - thanks for all you do. May your pagers be silent and your users be clueful today! http://t.co/N2XzFgw
RT @google: Hey, sysadmins - thanks for all you do. May your pagers be silent and your users be clueful today! http://t.co/y9TbCqb #sysadminday
RT @mfujiwara: http://www.sysadminday.com/
RT @mitchjoel: It's SysAdmin Day! Have you hugged your SysAdmin today? Make sure all employees follow the rules: http://bit.ly/17m98z #humor
? @mfujiwara: http://www.sysadminday.com/

Here as before we have to grep the results array that is contained inside the results.

I can also find all the restaurants near my village via SimpleGEO:

curl -x localhost:8001 -s "http://api.simplegeo.com/1.0/places/51.476959,0.006759.json?category=Restaurant"|
   jgrep --start features "properties.distance<2.0" -s "properties.address \
                                      properties.name \
                                      properties.postcode \
                                      properties.phone \
                                      properties.distance"
[
  {
    "properties.address": "15 Stratheden Road",
    "properties.distance": 0.773576114771768,
    "properties.phone": "+44 20 8858 8008",
    "properties.name": "The Lamplight",
    "properties.postcode": "SE3 7TH"
  },
  {
    "properties.address": "9 Stratheden Parade",
    "properties.distance": 0.870622234751732,
    "properties.phone": "+44 20 8858 0728",
    "properties.name": "Sun Ya",
    "properties.postcode": "SE3 7SX"
  }
]

There’s a lot more I didn’t show, it supports all the usual <= etc operators and a fair few other bits.

You can get this utility by installing the jgrep Ruby Gem or grab the code from GitHub. The Gem is a library so you can use these abilities in your ruby programs but also includes the CLI tool shown here.

It’s pretty new code and we’d totally love feedback, bugs and ideas! Follow the author on Twitter at @pieterloubser and send him some appreciation too.

Real time Puppet events and network wide callbacks

by R.I. Pienaar | Jul 3, 2011 | Uncategorized

I’ve wanted to be notified the moment Puppet changes a resource for ages, I’ve often been told this cannot be done without monkey patching nasty Puppet internals.

Those following me on Twitter have no doubt noticed my tweets complaining about the complexities in getting this done. I’ve now managed to get this done so am happy to share here.

The end result is that my previously mentioned event system is now getting messages sent the moment Puppet does anything. And I have callbacks at the event system layer so I can now instantly react to change anywhere on my network.

If a new webserver in a certain cluster comes up I can instantly notify the load balancer to add it – perhaps via a Puppet run of of its own. This lets me build reactive cross node orchestration. Apart from the callbacks I also get run metrics in Graphite as seen in the image.

The best way I found to make Puppet do this is by creating your own transaction reports processor that is the client side of a normal puppet report, here’s a shortened version of what I did. It’s overly complex and a pain to get working since Puppet Pluginsync still cannot sync out things like Applications.

The code overrides the finalize_report method to send the final run metrics and the add_resource_status method to publish events for changing resources. Puppet will use the add_resource_status method to add the status of a resource right after evaluating that resource. By tapping into this method we can send events to the network the moment the resource has changed.

require 'puppet/transaction/report'
 
class Puppet::Transaction::UnimatrixReport < Puppet::Transaction::Report
  def initialize(kind, configuration_version=nil)
    super(kind, configuration_version)
 
    @um_config = YAML.load_file("/etc/puppet/unimatrix.yaml")
  end
 
  def um_connection
    @um_connection ||= Stomp::Connection.new(@um_config[:user], @um_config[:password], @um_config[:server], @um_config[:port], true)
  end
 
  def um_newevent
    event = { ... } # skeleton event, removed for brevity
  end
 
  def finalize_report
    super
 
    metrics = um_newevent
    sum = raw_summary
 
    # add run metrics from raw_summary to metrics, removed for brevity
 
    Timeout::timeout(2) { um_connection.publish(@um_config[:portal], metrics.to_json) }
  end
 
  def add_resource_status(status)
    super(status)
 
    if status.changed?
      event = um_newevent
 
      # add event status to event hash, removed for brevity
 
      Timeout::timeout(2) { um_connection.publish(@um_config[:portal], event.to_json) }
    end
  end
end

Finally as I really have no need for sending reports to my Puppet Masters I created a small Application that replace the standard agent. This application has access to the report even when reporting is disabled so it will never get saved to disk or copied to the masters.

This application sets up the report using the class above and creates a log destination that feeds the logs into it. This is more or less exactly the Puppet::Application::Agent so I retain all my normal CLI usage like –tags and –test etc.

require 'puppet/application'
require 'puppet/application/agent'
require 'rubygems'
require 'stomp'
require 'json'
 
class Puppet::Application::Unimatrix < Puppet::Application::Agent
  run_mode :unimatrix
 
  def onetime
    unless options[:client]
      $stderr.puts "onetime is specified but there is no client"
      exit(43)
      return
    end
 
    @daemon.set_signal_traps
 
    begin
      require 'puppet/transaction/unimatrixreport'
      report = Puppet::Transaction::UnimatrixReport.new("unimatrix")
 
      # Reports get logs, so we need to make a logging destination that understands our report.
      Puppet::Util::Log.newdesttype :unimatrixreport do
        attr_reader :report
 
        match "Puppet::Transaction::UnimatrixReport"
 
        def initialize(report)
          @report = report
        end
 
        def handle(msg)
          @report << msg
        end
      end
 
      @agent.run(:report => report)
    rescue => detail
      puts detail.backtrace if Puppet[:trace]
      Puppet.err detail.to_s
    end
 
    if not report
      exit(1)
    elsif options[:detailed_exitcodes] then
      exit(report.exit_status)
    else
      exit(0)
    end
  end
end

And finally I can now create a callback in my event system this example is over simplified but the gist of it is that I am triggering a Puppet run on my machines with class roles::frontend_lb as a result of a web server starting on any machine with the class roles::frontend_web – effectively immediately causing newly created machines to get into the pool. The Puppet run is triggered via MCollective so I am using it’s discovery capabilities to run all the instances of load balancers.

add_callback(:name => "puppet_change", :type => ["archive"]) do |event|
   data = event["extended_data"]
 
   if data["resource_title"] && data["resource_type"]
      if event["name"] == "Service[httpd]"
         # only when its a new start, this is short for brevity you want to do some more checks
         if data["restarted"] == false && data["tags"].include?("roles::frontend_web")
           # does a puppet run via mcollective on the frontend load balancer
           UM::Util.run_puppet(:class_filter => "roles::frontend_lb")
         end
     end
  end
end

Doing this via a Puppet run demonstrates to me where the balance lie between Orchestration and CM. You still need to be able to build a new machine, and that new machine needs to be in the same state as those that were managed using the Orchestration tool. So by using MCollective to trigger Puppet runs I know I am not doing anything out of bounds of my CM system, I am simply causing it to work when I want it to work.

Facter facts from TXT, JSON, YAML and non ruby scripts

by R.I. Pienaar | Jun 29, 2011 | Uncategorized

There has been many discussions about Facter 2, one of the things I looked forward to getting was the ability to read arbitrary files or run arbitrary scripts in a directory to create facts.

This is pretty important as a lot of the typical users just aren’t Ruby coders and really all they want is to trivially add some facts. All too often the answer to common questions in the Puppet user groups ends up being “add a fact” but when they look at adding facts its just way too daunting.

Sadly as of today Facter 2 is mostly vaporware so I created a quick – really quick – fact that reads the directory /etc/facts.d and parse Text, JSON or YAML files but can also run any executable in there.

To write a fact in a shell script that reads /etc/sysconfig/kernel on a RedHat machine and names the default kernel package name simply do this:

#!/bin/sh
 
source /etc/sysconfig/kernel
 
echo "default_kernel=${DEFAULTKERNEL}"

Add it to /etc/facts.d and make it executable, now you can simple use your new fact:

% facter default_kernel
kernel-xen

Simple stuff. You can get the fact on my GitHub and deploy it using the standard method.

« Older Entries

Next Entries »