Uncategorized | R.I.Pienaar

Getting diffs for Puppet catalogs

by R.I. Pienaar | Nov 14, 2010 | Uncategorized

Puppet compiles its manifests into a catalog, the catalog is derived from your code and is something that can be executed on your node.

This model is very different from other configuration management system that tend to execute top down and just run through the instructions in a very traditional manner.

Having a compiled artifact has many advantages most of which aren’t really exposed to users today, I have a lot of ideas on how I would like to use the catalog – and the graph it contains. The first idea is to be able to compare them and identify changes between versions of your code.

For this discussion I’ll start with the code below:

class one {
    file{"/tmp/test": content => "foo" }
}
 
class two {
    include one
 
    file{"/tmp/test1": content => "foo";
 
         "/tmp/test2": content => "foo";
    }
}
 
include two

When I run it I get 3 files:

-rw-r--r-- 1 root root 3 Nov 14 11:32 /tmp/test
-rw-r--r-- 1 root root 3 Nov 14 11:32 /tmp/test1
-rw-r--r-- 1 root root 3 Nov 14 11:31 /tmp/test2

Being able to diff the catalog has a lot of potential. Often when you look at a diff of code it’s hard to know what the end result would be, especially if you use inheritance heavily or if your code relies on external data like from extlookup. Since the puppet master now supports compiling catalogs and spitting them out to STDOUT you also have the possibility to compile machine catalogs on a staging master and compare it against the production catalog without any risk.

The other use case could be during major version upgrades where you wish to validate the next release of Puppet will behave the same way as the old one. We’ve had problems in the past where 0.24.x would evaluate templates differently from later versions and you get unexpected changes being rolled out to your machines.

Lets make a change to our code above, here’s the diff of our change:

--- test.pp     2010-11-14 11:35:57.000000000 +0000
+++ test2.pp    2010-11-14 11:36:06.000000000 +0000
@@ -5,6 +5,8 @@
 class two {
     include one
 
+    File{ mode => 400 }
+
     file{"/tmp/test1": content => "foo";
 
          "/tmp/test2": content => "foo";

This is the kind of thing you’ll see in mail if you have your SCM set up to mail diffs or while sitting in a change control meeting. The change looks simple enough you want to just change the mode of /tmp/test1 and /tmp/test2 to 400 rather than the default.

When you run this code though you’ll see that /tmp/test also change! This is because setting defaults applies to included classes too, and this is exactly the kind of situation that is very hard to pick up from diffs and to be able to guess the full impact of the change.

My diff tool will have shown you this (format slightly edited):

Resource counts:
        Old: 516
        New: 516
 
Catalogs contain the same resources by resource title
 
 
Individual Resource differences:
Old Resource:
        file{"/tmp/test": content => acbd18db4cc2f85cedef654fccc4a4d8 }
New Resource:
        file{"/tmp/test": mode => 400, content => acbd18db4cc2f85cedef654fccc4a4d8 }
 
Old Resource:
        file{"/tmp/test1": content => acbd18db4cc2f85cedef654fccc4a4d8 }
New Resource:
        file{"/tmp/test1": mode => 400, content => acbd18db4cc2f85cedef654fccc4a4d8 }
 
Old Resource:
        file{"/tmp/test2": content => acbd18db4cc2f85cedef654fccc4a4d8 }
New Resource:
        file{"/tmp/test2": mode => 400, content => acbd18db4cc2f85cedef654fccc4a4d8 }

Here you can clearly see all 3 files will be changed and not just two. With this information you’d be much better off in your change control meeting than before.

The diff tool works in a bit of a round about manner and I hope to improve the usage a bit in the near future. First you need to dump the catalogs into a format unique to this tool set and finally you can diff this intermediate format. The reason for this is that you can compare catalogs from different versions of puppet code so you need to go via an intermediate format.

There’s one thing worth noting. I initially wrote it to help with a migration from 0.24.8 to 0.25.x or even 2.6.x and in my initial tests this seemed fine but on more extensive testing with bigger catalogs I noticed a number of strange things in the 0.24.x catalog format. First it doesn’t contain all the properties for Defined Types and 2nd it sets a whole lot of extra properties on resources filling in blanks left by the user.

What this means is that if you diff a 0.24.x catalog vs the same code on newer versions you’ll likely see it complain that all your defined type resources are missing from the 0.24 catalog and you might also get some false positives on resource diffs. I can’t do much about the missing resources but what I can do is clear up the false positives, I already handle the ones in my manifests but there are no doubt more if you let me know of them I’ll see about working around them too.

The code for this can be found in my GitHub account. It’s still a bit of a work in progress as I haven’t actually done my migration yet so subscribe to the repo there’s likely to be frequent changes still.

Puppet Labs aquires The Marionette Collective

by R.I. Pienaar | Oct 7, 2010 | Uncategorized

Today at Puppet Camp Luke Kanies and I announced the sale of The Marionette Collective to Puppet Labs – The company behind the well known Puppet Configuration Management system. There’s an official press release and FAQ up.

This is a pretty major milestone for me and MCollective as a project. MCollective started as an exploration of new ideas and eventually evolved into a viable product that large customers might want to use.

Personally I am not that interested in becoming a software vendor with sales, support and all that kind of thing. I’d much rather devote my time to the more technical aspects of such a venture. When I was approached by Puppet Labs about this I was pretty excited – they already know how to run a commercial open source project and have the team and structure in place to make it work.

In time we will be publishing a detailed road map for the product and how it will integrate with the plans at Puppet Labs. One of the plans I’m already very excited about is the integration into the Puppet Dashboard. A graphical front-end is something I’ve often wanted but do not have the skill set to build, Puppet Labs has a great team in place and together we’ll lower the barrier to entry of Data Center Automation significantly.

As a heavy Puppet user I’ve designed MCollective to work well with Configuration Management tools in general – by aligning commercially with what can rightfully be described as the best of breed CM tool and working on creating a tighter integration between the two projects I hope we’ll see some great innovation and improvement.

While I am not joining Puppet Labs on a permanent basis I will be working with the team on improving MCollective as well as Puppet, I will still be developing and still be around the community and advising on the direction that MCollective will take.

The work till now was done on my own time and funded by me personally. This sale will give me the opportunity to be paid for my ongoing work on the project and to devote serious time to it which will translate in a faster pace of innovation and achieving the goals of the current road map in a much quicker time than I first hoped for.

Finally I’d like to thank the devoted early adopter community, there’s been some excellent work done by utilizing MCollective as a framework and I look forward to growing this community. This move should validate your choice and time spent in supporting my project and I hope I can provide you all with an even better tool in the near future.

Follow up on GlusterFS

by R.I. Pienaar | Sep 30, 2010 | Uncategorized

I recently posted about my experiences with GlusterFS. Shortly after writing it I was contacted by the guys from GlusterFS keen to talk about my experiences on the phone.

I’ve just had this call and thought I’d summarize it here (with their permission):

They were very open and admitted freely that the points in my post came as no surprise to them in the current version of the code and the position they are in as a company.
They are now well funded and fully committed to establishing closer ties with the community, they are going to employ a community manager and revamp their community offerings around boards and so forth.
They are aware of the lack of deep technical documentation and will address this soon, they are essentially struggling to strike a good balance of User vs Advanced Administrator documentation and were erring on the side of User documentation.
Instrumentation will be greatly improved in the upcoming version of GlusterFS. Since they are a nice modular system they will essentially have a debug module that they can plug into a client that will then log a ton of events and metrics. Without rebuilding the code.
They are adding various command line and management utilities. One such utility is a tool to interrogate the bricks and give reports on the consistency of data, what exact files are out of sync etc.
There will be a new active replication option where the bricks will essentially communicate directly with each other removing the need for all the ls -lR stuff. The brick to brick comms should also be much quicker to do a full validation than the current method.
The fuse layer will receive a bit less prominence with NFS being the recommended way of connecting to the Gluster bricks.

There were some more points but you absolutely get the idea that the next version – already in Beta – will at least start the process of properly addressing many of the points I raised, confirming that both my idea of the current status is spot on and that they were aware of the issues and are working in the background resolving those.

I’d encourage people to keep an eye on GlusterFS, keep it in your bookmark list and keep following the news stream, I do still think the simplicity of their offering has serious merit if they can address my points in the next few releases GlusterFS would become a system worth considering.

I’ll hopefully have time to evaluate the next version thoroughly and will see about a follow up post at the time. The list of fixes/improvements is quite big and I have no doubt it will take at least a few releases to tick them all off but it all sounds like it’s heading in the right direction at least.

Update: It’s now been several months since this phone call. New code is out there etc. I’ve never heard from the people at Gluster again in contrast with the promises they made during this call. This is proof to me that still they do not have proper community handling and that this is affecting all levels of the company. For a commercial Open Source company this is a very big red light. I would not recommend GlusterFS as a solution to any problem ever as a result.

Experience with GlusterFS

by R.I. Pienaar | Sep 22, 2010 | Uncategorized

I have a need for shared storage of around 300GB worth of 200×200 image files. These files are written once, then resized and stored. Once stored they never change again – they might get deleted.

They get served up to 10 Squid machines and the cache times are huge, like years. This is a very low IO setup in other words, very few writes, reasonably few reads and the data isn’t that big just a lot of files – around 2 million.

In the past I used a DRBD + Linux-HA + NFS setup to host this but I felt there’s a bit too much magic involved with this and I also felt it would be nice to be able to use 2 nodes a time rather than active-passive.

I considered many alternatives in the end I settled for GlusterFS based on the following:

It stores just files, each storage brick has just a lot of files on ext3 or whatever, you can still safely perform reads on these files on the bricks. In the event of a FS failure or event your existing tool set for dealing with filesystems all apply still.
It seems very simple – use a FUSE driver, store some xattr data with each file and let the client sort out replication, all seems simple
I had concerns about FUSE but I felt my low IO overhead would not be a problem as the Gluster authors are very insistent – almost insultingly so when asked on IRC about this – that FUSE issues are just FUD.
It has a lot of flexibility in how you can construct data, you can build all of the basic RAID style setups just using machines of reasonable price as storage bricks
There is no metadata server, most cluster filesystems need a metadata server on dedicated hardware kept resiliant using DRDB and Linux-HA. Exactly the setup I wish to avoid and those are overkill if all I have is need for a 2 node cluster.

Going in I had a few concerns:

There is no way to know the state of your storage in a replicated setup. The clients take care of data syncing not the servers, so there’s no healthy indicator anywhere.
To re-sync your data after a maintenance event you need to run ls -lR to read each file, this will validate the validity of each file syncing out any strange ones. This seemed very weird for me and in the end my fears of this was well founded.
The documentation is poor, extremely poor and lacking. What there is applies to older versions and the code has had a massive refactor in version 3.

I built a few test setups, first on EC2 then on some of my own VMs, tried to break in various ways, tried to corrupt data and come up with a scenario where the wrong file would be synced etc and found it overall to be sound. I went through the docs and identified any documented shortfalls and verified if these still existed in 3.0 and mostly found they didn’t apply anymore.

We eventually ordered kit, I built the replicas using their suggested tool, set it up and copied all my data onto the system. Immediately I saw that small files is totally going to kill this setup. Doing a rsync of 150GB took many days over a Gigabit network. IRC suggested that if I am worried about the initial build being slow I can use rsync to prep the machines directly individually and then start the FS layer and then sync it with ls -lR.

I tested this theory out and it worked, files copied onto my machines quickly and the ls -lR at the end found little to change according to write traffic to the disks and network, both bricks were in sync.

We cut over 12 client nodes to the storage and at first it was great. Load averages was higher which I expected since it would be a bit slower to respond on IO but nothing to worry about. A few hours into running it all client IO just stopped. Doing a ls, or a stat on a specific file, both would just take 2 or 3 minutes to respond. Predictably for a web app this is completely unbearable.

A quick bit of investigation suggested that the client machines were all doing lots of data syncing – very odd since all the data was in sync to start with so what gives? It seemed that with 12 machines all doing resyncs of data the storage bricks just couldn’t cope, they were showing very high CPU. We shut the 2nd brick in the replica and IO performance recovered and we were able to run but now without a 2nd host active.

I asked on the IRC channel for options on debugging this and roughly got the following options:

Recompile the code and enable debugging, shut down everything and deploy the new code which would perform worse, but at least you can find out whats happening.
Make various changes to the cluster setup files – tweaking caches etc, these at least didnt require recompiles or total downtime so I was able to test a few of these options.
Get the storage back in sync by firewalling the bulk of my clients off the 2nd brick leaving just one – say a dev machine – start the 2nd brick and ls -lR fix the replica, then enable all the nodes. I was able to test this but even with one node doing file syncs all the IO on all the connected clients failed. Eventhough my bricks werent overloaded IO or CPU wise.

I posted to the mailing list hoping to hear from the authors who don’t seem to hang out on IRC much and this was met with zero responses.

At this point I decided to ditch GlusterFS. I don’t have a lot of data about what actually happened or caused it, I can’t say with certainty what events were happening that was killing all the IO – and that really is part of the problem, it is too hard to debug issues in a GlusterFS cluster as you need to recompile and take it all down.

Debugging complex systems is all about data, it’s all about being able to get debug information when needed, it’s about being able to graph metrics, it’s about being able to instrument the problem software. This is not possible or too disruptive with GlusterFS. Even if the issues can be overcome getting to that point is simply too disruptive to operations because the software is not easily managed.

Had the problem been something else – not replication related – I might have been better off as I could enable debug on one of the bricks but as at that point I had just one brick that had valid data and any attempt to sync the second node would result in IO dying it means in order to run debug code I had to unmount all connected clients and rebuild/restart my only viable storage server.

The bottom line is that while GlusterFS seems simple and elegant it is too hard/impossible to debug it should you run into problems. A HA file system should not require a complete shutdown to try out a lot of suggested tweaks, recompiles etc. Going down that route might mean days or even weeks of regular service interruption and that is something that is not suitable to the modern web world. Technically it might be sound and elegant, from an operations point of view it is not suited.

One small side note, as GlusterFS stores a lot of is magic data in x-attributes of the files I found that my GlusterFS based storage was about 15 to 20% bigger than my non GlusterFS ones, that seems a huge amount of waste. Not a problem these days with cheap disks but worth noting.

The Marionette Collective 0.4.9

by R.I. Pienaar | Sep 21, 2010 | Uncategorized

I’ve released version 0.4.9 of MCollective, it’s a bugfix/small feature release the only big thing is that I added a new agent called rpcutil that lets you get all sorts of information about the running collective.

You can for example find out a list of all the agents and all their meta data including versions etc, this will be great for documenting/auditing an infrastructure based on mcollective.

The mc-inventory script has been rewritten to use the new agent and now shows you some stats in addition to the nodes inventory when invoked:

   Server Statistics:
                      Version: 0.4.8
                   Start Time: Mon Sep 20 17:29:24 +0100 2010
                  Config File: /etc/mcollective/server.cfg
                   Process ID: 20215
               Total Messages: 122317
      Messages Passed Filters: 120731
            Messages Filtered: 1586
                 Replies Sent: 3701
         Total Processor Time: 191.29 seconds
                  System Time: 109.82 seconds

Full information about the release can be found here.

Having fun with IRB

by R.I. Pienaar | Sep 19, 2010 | Uncategorized

I’ve had a pipe dream of creating something like IRB for ruby but tailored for mcollective, something with DDL assisted completion and all sorts of crazy kewl things.

Having looked a few times into this I concluded IRB is a black box of undocumented voodoo and always gave up. I had another google this weekend and came across what set me off in the right direction.

Christopher Burnett has a great little Posterous post up showing how to build custom IRB shells. With a little further digging I came across Bond from Gabriel Horner. These two combined into something that will definitely be one of my favorite toys.

The result is a mcollective IRB shell that you can grab in the ext directory of the mcollective tarball – it brings in some native gem dependencies that I really don’t want in the base deploy.

It’s best to see it in action before looking at the code so you know what the behavior is, see the screen cast below.

Getting the basic mc-irb going was pretty much exactly as Christopher’s posterous shows so I won’t go into the detail of that, what I do want to show is the DDL based command completion with Bond.

        require 'bond'
        Bond.start
 
        Bond.complete(:method => "rpc") do |e|
            begin
                if e.argument == 1
                    if e.arguments.last == "?"
                        puts "\n\nActions for #{@agent_name}:\n"
 
                        @agent.ddl.actions.each do |action|
                           puts "%20s - %s" % [ ":#{action}", @agent.ddl.action_interface(action)[:description] ] 
                        end
 
                        print "\n" + e.line
                    end
 
                    @agent.ddl.actions
 
                elsif e.argument > 1
                    action = eval(e.arguments[0]).to_s
                    ddl = @agent.ddl.action_interface(action)
 
                    if e.arguments.last == "?"
                        puts "\n\nArguments for #{action}:\n"
                        ddl[:input].keys.each do |input|
                            puts "%20s - %s" % [ ":#{input}", ddl[:input][input][:description] ]
                        end
 
                        print "\n" + e.line
                    end
 
                    ddl[:input].keys
                end
            rescue Exception 
                []
            end
        end

This code checks the argument count, handles the first and subsequent arguments differently and supports the ? special case by loading the DDL and displaying the right stuff.

I’ve not found much by way of complex examples on the Bond site or its own bundled completions, hopefully this helps someone.